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Department of Electrical and Computer Engineering 
University of Illinois, Urbana-Champaign, 1989 

Computer vision has been regarded as one of the most complex and computationally intensive 
problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels 
of processing to perform for a high level application (e.g, object recognition). An IVS normally 
involves algorithms from low level, intermediate level and high level vision. Designing parallel 
architectures for vision systems has been of a tremendous interest to researchers. This thesis 
addresses several issues in parallel architectures and parallel algorithms for integrated vision systems. 

First, a model of computation for IVSs is presented. The model captures computational require- 
ments, defines spatial and temporal data dependencies between tasks, and shows what types of 
interactions may occur between tasks from different levels of processing. The model is used to 
develop features and capabilities of a parallel architecture suitable for IVSs. It is concluded that an 
architecture for IVS must be reconfigurable into different modes, be partitionable, allow dynamic 
resource allocation and task scheduling, provide flexible and fast communication between processing 
elements, provide efficient I/O and be fault-tolerant. 

A multiprocessor architecture for IVSs (called NETRA) is presented. NETRA is highly flexible 
without the use of complex interconnection schemes. NETRA is recursively defined hierarchical 
architecture whose leaf nodes consist of clusters processors connected with a programmable crossbar 
with a selective broadcast capability. Hence, it is easily scalable from small to large systems. Homo- 
geneity of NETRA permits fault tolerance and graceful degradation under faults. Several refinements 
in the architecture over the original design are also proposed. 



Performance of several vision algorithms when they are mapped on one cluster is presented. It 
is shown that SIMD, MIMD and systolic algorithms can be easily mapped onto processor clusters, 
and almost linear speedups are possible. For some algorithms, analytical performance results are 
compared with those obtained using an implementation. It is observed that the analysis is very accu- 
rate. 

An extensive analysis of inter-cluster communication strategies in NETRA is presented. A 
methodology to evaluate performance of algorithms on NETRA is described. Performance analysis 
of parallel algorithms when mapped across clusters is presented. The parameters are derived from the 
characteristics of the parallel algorithms, which are then, used to evaluate the alternative communica- 
tion strategies in NETRA. The effects of communication interference on the performance of algo- 
rithms are studied. It is observed that if communication speeds are matched with the computation 
speeds, almost linear speedups are possible when algorithms are mapped across clusters. 

Finally, several techniques to perform data decomposition, and static and dynamic load balanc- 
ing for IVS algorithms are described. These techniques can be used to perform load balancing for 
intermediate and high level, data dependent vision algorithms. These techniques are novel because 
they use knowledge about the data when it is produced and use knowledge about the computation in 
the next task to perform load balancing in an integrated environment. They are shown to perform 
well, using them on an implementation of a motion estimation system on a hypercube multiprocessor. 

Index Terms - Multiprocessor Architectures, Parallel Algorithms, Parallel Processing, Computer 
Vision, Image Processing 
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CHAPTER 1. 
INTRODUCTION 


1.1. Motivation and Scope 

One of the most important, difficult and computationally intensive problems in the 
field of artificial intelligence is computer vision. There is no consensus today on the 
definition and scope of computer vision. The problem of artificial vision is as old as the 
field of computer science and engineering. Researchers have devoted much time in 
attempting to define and solve parts of the problems for many years. However, to say that 
computer vision is in its infancy today is a correct judgment of the state of the art in 
artificial vision. Furthermore, nobody knows the answer to the question of whether it is 
possible to make artificial vision as powerful and general as human vision. One of the 
many reasons for not knowing the answer is that little is understood about human vision 
itself. 

There are several approaches to tackling the computational problems in computer 
vision. One of the approaches, which is also the oldest, is to use the computational 
powers of computers and their development in various fields of computer science and 
engineering, such as signal processing, mathematical and scientific algorithms, and graph 
theory. The other approach, which is relatively recent, is to somehow mimic the compu- 
tations performed in the human brain. This approach is termed as the neural network 
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approach. However, tremendous computational power in one form or another is needed 
in both the approaches. 

Computer vision and image understanding algorithms employ a very broad spec- 
trum of techniques from several areas such as signal processing, advanced mathematics, 
graph theory, and artificial intelligence. The computational requirements to perform algo- 
rithms from these fields are tremendous when executed individually, and when they need 
to be integrated in a meaningful way to perform a broader function in a reasonable 
amount of time, the computation becomes almost intractable. For example, consider 
interpretation of a changing scene at 30 frames per second. The amount of data to be han- 
dled per second itself is almost 25 Mbytes (million bytes) assuming a moderate resolution 
of 512x512 pixels per frame with each pixel of three bytes (one byte for each color and 
256 grey level). The amount of computation required for simple image transformations, 
labeling, grouping, surface reconstruction or motion analysis is very difficult to estimate; 
however, for many applications it can be in the range of 100 - 10,000 billion instructions 
per second [1]. This is raw processing power and does not include the complexities 
involved in a system such as interactions among various algorithms, input-output of data, 
managing system resources, and fault-tolerance. Therefore, the vision problem is of 
tremendous interest to computer architects and it presents them with great challenges. 

Having discussed that the need to provide tremendous processing power in an archi- 
tecture for computer vision, the next question is how can that processing power capabil- 
ity be provided? Parallel processing, which has progressed tremendously in the past 
decade, seems to be the consensus approach to providing the necessary computational 
power. Fortunately, most algorithms that are part of a vision system are in general, 
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characterized by massive parallelism. For low level processing, spatial decomposition of 
an image provides a natural way of generating parallel tasks. For higher level analysis 
operations, parallelization may also be based on other image characteristics and may be 
data dependent. In fact, parallel processing has been suggested as the approach to pro- 
vide computational power needed for most computational intensive problems such as 
scientific, vision or any other because technological limits are being reached in how fast 
a serial processor can perform. But the next question is what form of parallel processing, 
and what type of multiprocessor architectures are suitable for vision application? It may 
be easier to provide raw processing power by parallel processing, but the more important 
and difficult question is how to design multiprocessors so that the available processing 
power can be used efficiently. Since there is no consensus as to what a vision system con- 
sists of, another problem is how to evaluate or compare one architecture with another. 
Recently, efforts have been made to provide a framework and benchmark to evaluate 
multiprocessor architectures for vision which not only attempt to measure the processing 
power of an architecture but also test other architectural issues such as I/O, ability to per- 
form algorithms with varying characteristics, and effect on the performance due to 
interactions between tasks [1]. 

This thesis attempts to identify various issues in multiprocessor architectures and 
parallel algorithms for computer vision. The approach is to consider the computational 
requirements for vision in an integrated environment rather than to propose architectural 
solutions to perform one or more algorithms efficiently and fast We attempt to define the 
computational requirements for an Integrated Vision System (TVS), for which there is no 
general definition. However, an application dependent definition of an I VS is possible. 
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For example, object recognition, a system that takes an image (or a set of images) as 
input and produces an output that describes the object can be considered an IVS. How- 
ever, a system (or an algorithm) that takes an image input and produces its Discrete 
Fourier Transform (DFT) is not considered an Integrated Vision System, though comput- 
ing DFT itself may be a step or a part of an IVS. In fact, it is important to distinguish 
between image processing and computer vision (or IVS). Image processing involves 
transforming images by applying one or more algorithms to the input in order to make it 
more useful for interpretation by humans. For example, image enhancement, noise reduc- 
tion, scaling, and thresholding constitute image processing operations. Integrated Vision 
System, on the other hand, involves interpretation and recognition by the system itself 
using input data, parameters and knowledge base without any interference from humans. 
That is, the system is completely an automated vision system in the ideal case. There- 
fore, IVS can be defined as a system which employs a subset of vision algorithms in a 
systematic way to produce a meaningful output. The computational requirements for 
such an integrated vision system are tremendous [2]. 

Vision algorithms are normally divided into three levels: low level, intermediate 
level and high level. Low level algorithms are mostly image processing algorithms. 
These algorithms, in general, are very regular in structure, involve data independent and 
local computations, and involve pixel data. Available parallelism is normally on the 
pixel level. Intermediate level algorithms perform computations on the output produced 
by low level algorithms and involve more complex data structures, data dependent algo- 
rithms, symbolic processing, and involve varying degree of parallelism which itself 
depends on the data and the nature of the computation. Finally, high level algorithms not 


5 


only exhibit most of the properties of intermediate level algorithms but also involve top- 
down processing in which knowledge based interpretation is performed. Therefore, the 
algorithms may involve accessing databases, performing enormous searches and include 
other artificial intelligence algorithms. 

An Integrated Vision System will normally consist of algorithms from all levels of 
processing. Therefore, in addition to providing tremendous raw processing power an 
architecture must be capable of the following. It must have the ability to transform pixel 
data into a set of meaningful symbols that describe it, to process pixels, symbol data and 
other complex data structures in parallel, and the ability to simultaneously perform low, 
intermediate, and high level algorithms, and fast I/O. These requirements and others 
mean that an architecture must be reconfigurable, provide flexible and fast communica- 
tion structures between processing elements, provide different types of processing (such 
as SIMD, MIMD) to most efficiently execute algorithms from different levels of process- 
ing, be efficient in performing dynamic resource allocation and task scheduling, be parti- 
tionable into independent subsystems which can work on different computations simul- 
taneously, be fault-tolerant and provide fast I/O bandwidth to keep up with a tremendous 
amount of data flow. 

Design of multiprocessor architectures for IVSs, therefore, must address the require- 
ments posed by the above discussed characteristics of algorithms that are part of an IVS. 
In this thesis, we present a model of computation for IVSs for parallel processing. The 
model attempts to capture the properties of IVS algorithms, data flow and interactions 
between various tasks. Our model not only captures the computation requirements 
presented in the Image Understanding Benchmark presented in [1] but it also provides for 
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another dimension (time) of computation which is absent in the benchmark. Then we 
present an architecture for integrated vision systems called "NETRA." NETRA, in its ori- 
ginal form, was first proposed by Sharma, Patel and Ahuja in [3]. We propose several 
refinements to the architecture based on our understanding of computational require- 
ments for IVSs. An elaborate discussion is presented that gives the rationale behind the 
design. Several common vision algorithms are used to evaluate the performance of the 
architecture and alternative communication strategies. The algorithms are mapped using 
the multidimensional divide-and-conquer paradigm [4] which is an attractive mechanism 
for providing parallelism in all levels of processing. 

1.2. Review of Multiprocessor Architectures 

The advent of VLSI technology has enabled architects to produce high performance 
chips to perform specific applications. But these special purpose chips can only be used 
in an IVS as accelerators of specific algorithms. Another use of VLSI technology has 
been to create massively parallel Single Instruction Multiple Data (SIMD) processors for 
vision and other applications. There are also Multiple Instruction Multiple Data (MIMD) 
processors in which the number of processors are normally a few orders of magnitude 
less than that in SIMD (massively parallel) machines’, however, each processor is a 
powerful general purpose processor with its own program and data memory. Normally, 
MIMD machines fall into two categories: shared memory and distributed memory 
machines, though many architectures exhibit both paradigms. Within these 
classifications, multiprocessors are distinguished according to the interconnection topol- 
ogy between processors or processor-memories. Finally, there are systolic arrays, 
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hierarchical, partitionable multiprocessor architectures that have been proposed and stu- 
died for vision applications. Many of the architectures have been proposed just for vision 
application but almost all multiprocessors have been studied for vision applications. In 
the following discussion, we examine many of the architectures, describe the topology, 
salient features and limitations, and discuss their advantages and disadvantages with 
respect to solving vision problems. We use the topology as the main classification of the 
architectures in describing them; however, within a topology, if machines exist that can 
be further classified, then we will present a discussion on them. 

1.2.1. Mesh connected computers 

Mesh connected multiprocessors have been one of the first multiprocessors pro- 
posed for computer vision and image processing applications. For image processing 
applications, meshes seem to be an obvious choice because the images map quite natur- 
ally onto its structure. Figure 1.1 shows the topology of a mesh-connected computer. A 
typical machine consists of a large number of processing elements (PEs) arranged in a 
square array. Most typical is a 4-connected mesh in which each processor is connected to 
its four nearest neighbors. However, 6 (hexagonal) and 8 connected meshes have also 
been proposed. Most machines built on this topology are SIMD type of machines. Each 
PE has its local memory, and it responds in SIMD mode to the instructions broadcast by 
a controller. The PEs can be selectively masked using mask registers. 

The advantage of this architecture is that images map quite naturally onto its struc- 
ture. When the image size matches the size of the multiprocessor (e.g., NxN mesh for 
NxN image), maximum parallelism can be obtained for those operations that require 
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Figure 1.1 : A 4-Connected Mesh Computer 


computations on individual pixels or a very small neighborhood of pixels. However, this 
type of architecture has several limitations. There are many low and intermediate level 
vision algorithms that involve grouping or matching of image structures which are spa- 
daily distant in an image. But in meshes, communication across large distances is expen- 
sive and inefficient. Therefore, unless the computation is regular and local, meshes do not 
perform well. Furthermore, meshes have been proposed only as SIMD machines, and that 
means lack of MIMD processing capability that is necessary to support high level vision. 
In order to cost-effectively build a multiprocessor with thousands of processors, indivi- 
dual processors must be small, given the technological limitations. Normally, a typical 
machine will have PEs with 1-bit ALUs and a small memory, which may be sufficient for 
small pixel based operations but definitely lacks the power that is needed for intermediate 
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and high level operations. Finally, to most efficiently use a mesh, it is required that the 
data size exactly match the processor size, which is a severe limitation. 

Several mesh-connected multiprocessors have been built. Examples of mesh- 
connected computers include CLIP-4 [5,6,7], GRID [8], GAPP [9], and the MPP 
[10, 11, 12]. Each of these machines has its own special features, but all of them have the 
same general form. One major drawback of these machines has been the inefficiency of 
collecting results and rapid evaluation of the results due to communication bottlenecks. 
This reflects the fact that they were designed and built as stand-alone image processors 
used primarily for image enhancements in which the results of processing are intended 
for interpretation by humans rather than forming the first stage of an autonomous vision 
system. Several enhancements to a mesh have been proposed to alleviate the global com- 
munication problems. Wrapped around connections of the boundary PEs is one of them 
in which top row PEs are connected to the bottom row PEs and the first column PEs are 
connected to the last column PEs. This arrangement is called Torus. This reduces the 
long distance communication time, but the order still remains the same. Other enhance- 
ments include connecting PEs in rows and columns by busses to broadcast common data, 
but these enhancements do not alter the basic structure, advantages and limitations of a 
mesh-connected computer. 

1,2.2. Pyramid computers 

The concept of pyramid computers is essentially an extension of meshes in the third 
dimension. This structure has been proposed in various forms, but the main idea is that an 
image sized mesh-connected array is augmented by layers of successively lower resolu- 
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tion mesh-connected arrays as shown in Figure 1.2. Each array in a pyramid is typically 
one fourth as large as the array below it. Except for the bottom array, each PE in a 
pyramid is connected to four processors in the level below it, in addition to the neighbor 
connections in the same level. Formally, a pyramid consists of (l/2)log N + 1 levels, 
where the i-th level, 0£i£(l/2)log N, is a mesh with N/4‘ PEs. Each level has connections 
to the level above and below, giving each internal PE 9 connections: 4 to its children in 
the level below, 4 to its nearest neighbors at the same level, and 1 to its parent in the 
level above. All the PEs operate in SEMD mode under the directions of a single con- 
troller. Several pyramids have been proposed and built and examples include PAPIA 
[13], SPHINX [14], MPP Pyramid [15], HCL Pyramid [16,17], and others 
[18, 19,20,21]. 

Pyramid multiprocessor architecture provides straightforward implementation of the 
divide-and-conquer based approach. Such pyramids are natural candidates for executing 
divide-and-conquer algorithms, as they most closely mirror the flow of information in 
these algorithms. The pyramid processor provides the capability for quickly changing 
the resolution of an image, which can significantly improve the execution speed of some 
low level algorithms, especially for those that depend upon communication between cells 
that are spatially distant in an image. However, pyramid processors are more difficult to 
build than meshes because of the more complex arrangement for communication links 
and require twice the number of processing elements for the same image resolution. 
Hence, no pyramid multiprocessor has been built commercially. 

Despite the fact that a pyramid machine has multiple levels of processing elements, 
it should not be concluded that a pyramid is suitable for implementing the multiple levels 
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Figure 1.2 : A Pyramid Multiprocessor 

of processing required in an integrated vision system. The pyramid only implements an 
image resolution hierarchy, whereas vision requires an architecture that implements a 
hierarchy of abstraction levels. In a pyramid machine, all the processors are identical and 
execute in SIMD fashion. A vision machine, on the other hand, requires a different type 
of processing at different levels and in a variety of modes of parallelism including both 
SIMD and MIMD. Furthermore, from a purely architectural point of view, utilization of 
pyramid processors, in general, tends to be very low because at each level the slowest 
processor(s) is (are) the botdeneck, and the pipeline of computation (bottom-up) is lim- 
ited by the slowest processor, thereby limiting the utilization at all the levels. Therefore, 
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the pyramid machines, like meshes, are mostly suitable for early low level vision algo- 
rithm only and lack the flexibility for and processing capabilities needed for complex 
vision applications. 

The effectiveness and performance of architectures such as pyramid, array proces- 
sors, and meshes are limited as architectures for integrated vision systems due to several 
reasons. First, they are mostly suitable for SIMD types of algorithms which only consti- 
tute low level vision operations. Second, the architectures are inflexible due to the rigid 
interconnections. Third, the number of processors needed to solve a problem of reason- 
able size is thousands. Such a large number of processors is not only cost prohibitive, 
but the processors themselves cannot be very powerful and can have only limited 
features due to technological limitations. Fourth, it is normally assumed that the problem 
size exactly matches the number of processors available. Most of the time it is not clear 
how to adapt algorithms so that problems of different sizes can be solved on the same 
number of processors. Finally, the problem of input-output of data and fault-tolerance is 
rarely addressed in any of these architectures. It is important to note that no matter how 
fast or powerful a particular architecture is, its utilization can be limited by the 
bandwidth of the I/O. Furthermore, due to rigidity of most architectures, a failure nor- 
mally either results in the failure of the entire system, or the performance degrades 
tremendously. It is important that any architecture for such a complex problem should 
provide for graceful degradation which can be achieved by flexibility of the interconnect 
and capabilities to efficiently reconfigure and partition the architecture. 
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1.2 3 . Hypercube multiprocessors 

Hypercube multiprocessors provide more efficient long distance communication 
that is absent in meshes or pyramids. Machines in this class consist of processors con- 
nected by communication links whose arrangement is topologically equivalent to an n- 
dimensional cube. A hypercube consists of N=2 n PEs for an n dimensional cube. Each PE 
is connected to n other PEs such that their binary representations differ in exactly one bit 
position. Therefore, any PE can communicate with any other PE using at most n com- 
munication links. Figure 1.3 illustrates the organization of a hypercube multiprocessor. 

Several commercially available machines have been built that use the hypercube 
topology. Both SIMD and MIMD types of machines have been built. The Connection 
machine is an SIMD hypercube multiprocessor [22]. However, in a connection machine, 
two communication networks are provided. Each PE is connected to its four NEWS 
neighbors through a NEWS network, and groups of processors are connected in a hyper- 



Figure 1.3 : A Hypercube Multiprocessor 
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cube fashion that provides efficient long distance communication. Such a machine can be 
used for most low level vision algorithms and some intermediate vision algorithms. How- 
ever, like in other SIMD machines, lack of MIMD processing capability precludes its use 
for high level vision. Furthermore, low and intermediate processing cannot occur simul- 
taneously, which is a necessity for complex, real-time vision systems. 

MIMD hypercube multiprocessors are also commercially available. In fact, several 
companies have built MIMD hypercubes of large sizes (up to 1024 processors). Exam- 
ples include Intel Hypercube [23], NCube [24], and Cosmic cube [25]. A typical proces- 
sor node in a machine consists of a general purpose microprocessor (e.g., 80386 and 
coprocessor 80387 in Intel iPSC/2), local memory and routing hardware. Each multipro- 
cessor is controlled by a host processor. The advantage of the hypercubes is that they 
provide efficient long distance communication between processors. Although hypercube 
machines with large dimensionality have been built, current systems are not very 
efficient due to slow communication bandwidths and tremendous overheads of running 
an application. However, a hypercube machine can be used for some intermediate level 
and high level vision applications. One major disadvantage with hypercubes is that in 
order to efficiently utilize the machine, the algorithms running should somehow use the 
underlying topology. Nevertheless, of most existing machines, hypercubes (especially 
MIMD) have proved to be the most cost effective machines for research and develop- 
ment of scientific as well as vision applications and have helped tremendously in learning 
issues in parallel processing in general. But we think that due to rigidity of the architec- 
ture, lack of global control and inefficient communication (especially in large machines) 
will prohibit hypercubes from being multiprocessors for complex vision applications. 
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1.2.4. Shared memory machines 

Shared memory multiprocessors proposed and built are normally MIMD machines. 
Each PE is a general purpose processor with a small local memory. Each PE has access 
to a large global memory through an interconnection structure that connects the PEs and 
the global memory. The design of an interconnection network itself has been a huge area 
of research. Almost all the machines built today have variations of two common inter- 
connection networks: bus-based and multistage interconnection networks. All the inter- 
connections (in the machines built) are a variation of the two approaches. Bus-based sys- 
tems have a limitation on the number of processors, due to the bus access bottlenecks, 
and therefore, are not easily scalable. However, design is relatively simple and cost- 
effective. Machines have been built using up to 32 processors in such a system. Sequent 
Balance [26] and Encore Multimax [27] are two good examples of bus-based, shared 
memory multiprocessors that are commercially available. 

Another class of shared memory multiprocessors use multistage interconnection 
networks for processor-processor or processor-memory interconnections. Some 
bottlenecks of bus based systems are alleviated in such a system; however, the intercon- 
nection networks are complex to build. Scalability in such architectures is much better 
than that in bus-based systems, and machines with up to 128 processors have been built. 
Most machines built or being built have been for research purposes. Examples of these 
machines include BBN Butterfly (Commercially available) [28], IBM RP-3 [29], and 
Cedar, which is being developed at University of Illinois [30]. The main advantage of 
shared memory architecture is the ease of programming and uniform view of the system. 
In other words, control of information and synchronization is much easier compared to 
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that in distributed memory systems. Therefore, this class of machine is best suited for 
high level vision tasks. However, since communication between processor and all the 
interaction between cooperative tasks is done through the global memory there are 
bottlenecks, and hot spots occur. Furthermore, accessing global memory is at least an 
order of magnitude higher than accessing local memories, and therefore, communication 
speed is very slow compared to computation speed. Hence, such machines are efficient 
for only large grain parallelism tasks which have little interactions and exhibit regular 
memory access patterns. Since processes interact with each other using global memory 
shared variables, the comparative overhead of synchronization is very high and also 
results in hot spots. Because the actual image processing operations execute relatively 
quickly when they are divided among multiple processors, the process start-up and syn- 
chronization overhead rapidly grows to dominate the processing time. Therefore, scala- 
bility is definitely a problem in any shared memory multiprocessors. It is possible to 
build big machines, but the return of using larger sized shared memory multiprocessors 
to solve a problem becomes negative beyond a certain size. 

1.2.5. Systolic arrays 

Originally systolic arrays were proposed for special purpose computations. A sys- 
tolic array multiprocessor consists of processors connected in a certain fashion in which 
on each machine cycle each processor takes values from its input ports, performs the 
required computation, and passes the results and data onto its output ports. A systolic 
array can be perceived as a pipeline of a series of processing stations. Once the pipe is 
filled with data, all of the processing stations operate on values in parallel. Systolic array 
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elements can be either general purpose programmable function units or special purpose 
fixed function units. The latter are not useful for vision applications because of their 
inflexibility. The primary advantage provided by a programmable systolic array is high 
performance for low cost. They are, however, best suited for image processing tasks, but 
can work well with any application that involves large arrays of data and regular compu- 
tation. The main disadvantage of a systolic array is that any evaluation of processing 
results must wait until all the data has passed through the array. If a systolic array 
processes an image in one frame time, then this restriction has the effect of allowing the 
controlling process to make the decision and change the array’s programmed functions 
once each frame time. In a systolic array, it is thus much more difficult for a vision sys- 
tem to quickly and flexibly adapt its processing strategy to the actual characteristics of an 
image. 

CMU Warp systolic processor is an example of a programmable systolic array 
designed and built for scientific and image processing applications [31,32,33,34,35]. 
The Warp machine is a systolic array computer of linearly connected cells, each of which 
is a programmable processor capable of performing 10 MFLOPS. Figure 1.4 shows the 
organization of the Warp computer (taken from [35] ). A typical Warp array includes 10 
cells, though it is claimed that it can be extended if more cells are needed [35], The 
Warp array consists of identical cells called Warp cells, as shown in Figure 1.4 Data flow 
through the array on two communication channels (X and Y). Those addresses for cells’ 
local memories and control signals that are generated by the Interface Unit propagate 
down the Adr channel. The direction of Y is statically reconfigurable. For more details 
the reader is referred to [35]. The Warp array can be used for both fine-grain and large- 
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Warp Processor Array 


Figure 1.4 : Warp System Overview 

grain parallelism. It is efficient for fine-grained parallelism because of its high inter-cell 
bandwidth. It is also claimed to be efficient for large-grain parallelism because it is com- 
posed of powerful cells. Each cell is capable of operating independently; it has its own 
program sequencer and program memory. Even though Warp can perform in many 
modes, it is mostly suitable for low and intermediate level vision and does not have the 
desired flexibility, due to its organization, for efficiently performing high level vision 
algorithms. 

1.2.6. Partitionable and hierarchical architectures 

There have been numerous architectures designed and developed for vision that 
cannot be put in any of the classes discussed above. Some of the architectures include 
PM4 [36], PASM [37], REPLICA [38], INSPECTOR [39], and IUA [40]. Design of 
these architectures has addressed the issues of flexibility, partitionability, and 
reconfigurability which are needed in an architecture for an IVS. The following is a brief 
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discussion on some of these architectures, their merits and limitations. An important and 
common characteristic of these architectures is that they are capable of being partitioned 
into one or more independent SIMD and MIMD subsystem. 

PM4 : A Reconfigurable Multiprocessor : 

The PM4 represents one of the first proposals for a reconfigurable multiprocessor 
capable of executing several MIMD and SIMD processes concurrently [36]. It includes a 
large number of processing units constituting a pool. This pool of processing units can be 
partitioned into groups, each one of which can operate independently in either SIMD or 
MIMD mode. Reconfiguration of system resources is dynamic and is primarily software 
controlled. The components of the system include 1) N identical Processor- Memory 
Units (PMUs), 2) K identical Vector Control Units (VCUs), and 3) A three-level 
hierarchical memory connected by a set of interconnection networks and memory 
managements units. 

The PMUs are the basic processing units and they include a microprocessor, a local 
memory and a Local Memory Management Unit (LMMU). The local memory is com- 
posed of interleaved memory modules and serves as a local cache for the microprocessor. 
The VCUs control groups of processors operating in an SIMD mode. The Inter-Processor 
Communication Network (IPCN) implements permutation functions during the execution 
of an SIMD process. Management of the shared memory is accomplished by a Shared 
Memory Management Unit (SMMU), which communicates with the LMMUs and with 
the File Management Control Unit. The Processor-Memory Interconnection Network 
transfers bursts of data or instructions between the shared memory and the PMUs. 
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PASM: 

The PASM, a Partitionable SIMD MIMD Multiprocessor, is also a dynamically 
reconfigurable into one or more independent SIMD and/or MIMD machines [37]. The 
system is composed of Parallel Computation Unit (PCU) which includes N microproces- 
sors, N memory modules and an Interconnection Network (IN) connecting them. There is 
a set of Q Microcontrollers (MC), each controlling NIQ processors. Memory management 
tasks are distributed over a set of processors constituting the Memory Management Sys- 
tem (MMS). The system is to operate under control of a uniprocessor System Control 
Unit or SCU which would be responsible for job scheduling and for coordinating loading 
memory modules within the PCU. Only higher levels of these tasks need to be executed 
on the SCU; the details can be distributed over the MCs and MMS. 

Two processor-memory configurations are being examined for the PCU. In the PE- 
to-PE configuration, each processor has a local memory and the composite processor- 
memory units (PEs) communicate via the IN. When data is to be obtained from the 
memory of another PE, the two PEs involved cooperate to effect the transfer. Two pro- 
cessors are, therefore, involved for any non-local reference. In the P-to-M configuration, 
processors are connected on one side of the Interconnection Network and memory 
modules on the other. The processors do not have a local memory and can access any of 
the modules on the other side of the network. As a result, no explicit data transfers from 
one processor to another are required. However, all references now have to go via the 
Interconnection Network. The system can be partitioned into one or more partitions, each 
with RN/Q processors, where R =2 r . Each partition can operate in either an SIMD or an 


MIMD mode. 
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The code required to be transferred to more than one MCs (for execution on correspond- 
ing sets of N/Q processors) is broadcast to the selected MCs over a bus. 

The REPLICA : 

The REPLICA was designed as a special purpose computer for multi- sensory per- 
ception of 3-D objects [38]. Its main features include support for clean and flexible parti- 
tionability with minimal fragmentation and modularity. The machine consists of the fol- 
lowing components: 1) A pool of N processing elements, each with a local memory, 2) A 
two- level control hierarchy. At level- 1 are the controllers and at level-2 are monitors. 
The monitor layer is responsible for scheduling tasks and reconfiguring and partitioning 
the system. The set of M controllers is uniformly distributed over the system- one for 
each group of NIM processors, 3) A memory management system controlling a large 
shared memory and a secondary memory and 4) Four interconnection networks. One net- 
work handles communication between monitors and controllers. A second one, a capabil- 
ity enhanced crossbar, connects the controller to the PEs. Shift-register rings are used for 
communication between PEs within a partition. Finally, a high bandwidth bus is sug- 
gested for communication between the PEs, controllers, sensors and memory. This bus is 
to support VO and the transfers of data and programs between main memory and the 
local memories of the PEs. 

The clean partitionability is attributed to the capabilities of controller-processor and 
processor-processor interconnection networks. It is claimed that these networks allow 
variable size partitions composed of arbitrary subsets of processors. Partitions can be set 
up rapidly and are totally isolated from each other (i.e., the partitions are clean). 
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The Image Understanding Architecture : 

The Image Understanding Architecture (IUA) integrates parallel processors operat- 
ing simultaneously at three levels of computational granularity in a tightly-coupled archi- 
tecture [40]. Each level of the IUA is a parallel processor that is distinctly different from 
the other two levels, in order to best meet the processing needs for different levels of 
algorithms in a computer vision application. Communication between levels takes place 
via parallel data and control paths. 

The bottom level of the architecture contains an associative processor called the 
Content Addressable Array Parallel Processor (CAAPP). The CAAPP is a 512x512 array 
of 1-bit serial processors designed to operate on arrays of pixels and to construct 
intermediate-level tokens from events in an image. At the intermediate level, an array of 
64x64 16-bit processors, called the Intermediate Communications and Associative Pro- 
cessor (ICAP), are used for the intermediate level of processing. Specifically, the proces- 
sors are used for retrieving, comparing and matching tokens, computing geometric rela- 
tionships between tokens, and constructing new tokens that describe more abstract enti- 
ties. At the top level (called high level) is the Symbolic Processing Array (SPA) which is 
a set of 64 processors capable of executing LISP programs. Their function is to support 
computation involving inference, hypothesis generation and verification, analysis of 
uncertainty, model-based processing and control of processing at the lower levels. 
Currently, a l/64th of the IUA is currently being constructed by the University of Mas- 
sachusetts and Hughes Research Laboratories. 
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1.3. Organization of the Thesis 

This thesis contains 7 chapters. The following is an overview of the contents of each 
chapter and the organization of this thesis. 

Chapter 2 presents a model of computation for IVSs. The model is presented from 
parallel processing perspective. An attempt is made to capture the computation require- 
ments, to recognize data dependencies between tasks, and capture the temporal flow of 
computation. The model is used to develop architectural requirements for multiproces- 
sors for IVSs applications. These requirements broadly describe features that should be 
present in a multiprocessor design in order for it to be efficient for IVSs. 

Architecture of NETRA is presented in Chapter 3. NETRA is a recursively defined 
tree-type hierarchical architecture whose leaf nodes consist of a cluster of processors 
connected with a programmable crossbar with selective broadcast capability to provide 
for desired flexibility. The internal nodes are scheduling processors and their function is 
task scheduling, load balancing, and global memory management. All the scheduling 
processors and the cluster processor are connected to a global memory through a multis- 
tage circuit switched network. The processors in clusters can operate in SIMD, MIMD or 
systolic mode, and therefore, suitable for both low level as well as high level vision algo- 
rithms. A discussion is presented that critically examines the features of NETRA in the 
light of architectural requirements developed in Chapter 2. 

Chapter 4 presents how to map an algorithm on a processor cluster in NETRA in 
various modes such as SIMD and MIMD. Then performance evaluation of algorithms 
when mapped on one cluster is presented. The algorithms are chosen so that they exhibit 
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different communication requirements when mapped in parallel. Performance of some 
algorithms on a simulated cluster is also presented. It is concluded that good speedups 
and performance can be obtained on a cluster because of the availability of a programm- 
able crossbar which provides the necessary flexibility in mapping algorithms with vary- 
ing characteristics. 

Inter-cluster communication is discussed in Chapter 5. A general method of analysis 
of inter-cluster communication is presented. Two alternative inter-cluster communication 
networks, namely, bus and multistage, are evaluated. The analysis of inter-cluster com- 
munication is used to evaluate performance of various algorithms when mapped across 
multiple clusters. When algorithms are mapped onto multiple clusters, the performance is 
affected by conflicts and interference in the global interconnection networks. These 
effects are incorporated in the analysis, and it is concluded that if interconnection 
bandwidth is fast enough then good performance results can be obtained even in the pres- 
ence of high conflicts. 

Chapter 6 presents data decomposition, load balancing and task scheduling tech- 
niques for data dependent algorithms. The techniques exploit the knowledge about the 
data gathered from the current task and use the knowledge about involved computations 
in the next task in order to partition the data onto the available processors so that load 
balancing and high utilization are achieved. In an IVS, in most cases, such information 
can be available because the flow of tasks and their dependencies are known in advance. 
In order to evaluate the performance, implementation results for a few algorithms that are 
part of a motion estimation system are presented when implemented on a hypercube mul- 
tiprocessor system. The reason for choosing a hypercube multiprocessor is that using an 
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existing machine helps capture the overheads associated with such techniques. 


Summary, conclusions and directions for future work are presented in Chapter 7. 



26 


CHAPTER 2. 

MODEL OF COMPUTATION FOR INTEGRATED VISION SYSTEM 


Computer vision transcends a wide range of representations and forms of process- 
ing. Despite advances in many sub-areas of computer vision, there is no consensus on a 
unified approach to vision. However, one can define certain general characteristics of an 
Integrated Vision System (TVS) from computational perspective. For example, it is 
known that a "vision system" must be able to perform diverse sets of complex operations 
on a massive amount of data at high speeds. Motion sequences at moderate resolution 
(512x512 pixels) and typical frame rate (30 ffames/sec) in color (3 bytes) involve more 
than 20 Mbytes of data per second. The amount of computation required for dynamic 
scene interpretation including labeling objects, surface reconstruction and motion 
analysis is difficult to estimate; however, for many applications computational power in 
the range of 10 12-14 instructions per second is required [40]. Not only are the raw pro- 
cessing needs tremendous, but varying the type of processing capabilities (such as 
number crunching, symbol manipulation, and data processing ) are required. 

Parallel processing in some form has been accepted as the approach to providing the 
necessary computational power to solve complex vision problems. But several questions 
remain. What type of parallel processing is best suited? What architectural features are 
needed? How is the performance of a multiprocessor architecture measured and how is 
its effectiveness as an architecture for IVSs evaluated? Several attempts have been made 
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to define benchmarks that capture processing needs for vision tasks [41,42,43]. 
Recently Weems et al. designed a benchmark for integrated vision systems that attempts 
to capture different forms of processing, and includes algorithms with different charac- 
teristics and their interactions [1]. However, the benchmark does not include "time" 
dimension in the sense that motion and time varying information are omitted from the 
benchmark. 

In this chapter, we attempt to define a model of computation for integrated vision 
systems (TVS) from parallel processing perspective. The model also includes the time 
dimension and is more general. It can be used to critically examine a multiprocessor 
architecture proposed for IVSs. However, it is not a benchmark that can be used to evalu- 
ate architectures. Using the model we attempt to identify the architecture requirements 
for IVSs as well as provide a framework to design new benchmarks to evaluate architec- 
tures. 

2.1. Parallelism in IVSs 

Available parallelism in integrated vision systems can be placed in two broad 
categories: namely, Spatial and Temporal Parallelism. Within the categories, the avail- 
able parallelism can be further sub-divided into different classes. The classes depend on 
the type of tasks (or algorithms) constituting the system, the type of architectures on 
which the tasks are to be implemented, the methodology used to implement tasks, 
interactions between the tasks, and control and data flow between the tasks. For exam- 
ple, a task may exhibit suitability for data parallelism at the lowest level and can be 
implemented on a massively parallel SIMD architecture; or a task may exhibit data 
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dependent, non-uniform computation, and therefore, be suitable for implementation on 
an MIMD architecture in a sub-tasks parallelism mode in which sub-tasks cooperate to 
produce results. 

Spatial Parallelism is one in which similar operations are applied in all pans of the 
image data. That is, the data can be divided into many granules and distributed to sub- 
tasks which may execute on different processors in parallel. Most vision algorithms 
exhibit this type of parallelism. In an IVS, each task operates on the output data of the 
previous task in the system. Therefore, the type of data, and data structures may be dif- 
ferent for each task in the system but each form of data can be partitioned into several 
granules to be processed in parallel. For example, consider an IVS that performs object 
recognition. The input image is smoothed using some filtering operation, then on the 
smoothed image an operator is applied for feature extraction, features with similar 
characteristics are grouped, then matching with the models is performed. Each of these 
tasks takes the output of the previous tasks as its input and produces an output which 
becomes the input for the next task. Note that within spatial parallelism, depending on 
the computation involved, an algorithm implementation may be suitable for data parallel- 
ism or task parallelism or both. 

Temporal Parallelism is available when these tasks are repeated on a time sequence 
of images or on different resolutions of images. For example, the system in which motion 
of a moving object is estimated takes a sequence of images of the moving object and per- 
forms the same set of computation on all image frame(s). The processing of each frame 
or a set of frames can be done in parallel with the processing of frames of other time 


instances. 
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Figure 2.1 shows the computational model for IVS which illustrates the above men- 
tioned characteristics of an IVS. Each pipeline shows a number of tasks applied to a set 
of inputs. The input to the first task in a pipeline is the image, and the input to the rest of 
the tasks is the output of the previous task. The set of pipelines illustrates that the entire 
pipeline of tasks is repeated on different images in time and/or resolution. Each block in 
the pipeline represents one task. Each task is decomposed into subtasks to be performed 
in parallel. For example, T\ is one task, and T\(d\) is a subtask of T ! operating on data 
granule d\. The figure shows m tasks in the pipeline. The number of subtasks depends on 
the amount of data in a granule and number of available processors. D il+1 represents data 
transfer from task 7, to task 7 i+1 in the pipeline. The model does not make any assump- 
tions about a particular implementation of a task. 

2.2. Data Dependencies 

Existence of spatial and temporal parallelism may also result in two types of data 
dependencies, namely, spatial data dependency and temporal data dependency. Spatial 
data dependency can be classified into intratask data dependency and intertask data 
dependency. Intratask data dependencies arise when a set of subtasks needs to exchange 
data in order to execute a task in parallel. The exchange of data may be needed during 
the execution of the algorithm, or to combine the partial results, or both. Therefore, each 
task itself is a collection of subtasks which may be represented as a graph with nodes 
representing the subtasks and edges representing communication between subtasks. Inter- 
task data dependency denotes the transfer and reorganization of data to be passed onto 
the next task in the pipeline. The mode of communication may be subtasks of the current 
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Figure 2. 1 : Model of Computation for an Integrated Vision System 


tasks to the subtasks of the next task, or collection and reorganization of the output data 
of the current task and then redistribution of the data for the next task. The choice and 
method depend on the underlying parallel architecture and mapping of algorithms. Tern- 
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poral data dependency is similar to spatial data dependency except that some form of 
output generated by tasks executed on the previous image frames may be needed by one 
or more tasks executing on the current image frames. A simple example of such a depen- 
dency is the IV S of motion estimation in which features from the previous image frames 
are needed in the processing of the current image frames so that features can be matched 
to establish correspondence between features of different time frames. 

The total computation to execute one pipeline includes time to input data, time to 
output data and results, <sum of the times to execute all tasks in the pipeline (which 
includes computation time of subtasks and communication time between subtasks) and, 
data transfer and reorganization time between two successive tasks. Let’s denote t cp as 
computation time for a subtask, f comm as total communication time for a task, t in as data 
input time, t oul as data output time, and t d as data transfer and reorganization time. Then 
time to complete task i, denoted as x, is given by 

=i^AX t C p(7',(dy)) + t C omm(Ti) (2.1) 

Total time to execute one pipeline including the input and output of data is given by 

i«m i«m-l 

^ lot "jLj ti(0| |+i)+ tj>| + toul (2.2) 

i-1 i-1 

Let us now consider some characteristics of the algorithms involved in IVS, and 
using the above model determine desired features and capabilities of a multiprocessor 
architecture suitable for IVS. First, an IVS involves algorithms from all levels of pro- 
cessing, i.e., an IVS normally includes low, medium and high level vision algorithms. 
Typically, the first few tasks of the pipeline are low level algorithms and the last few are 
high level algorithms. The low level algorithms are well understood and well defined. 
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They are normally data independent, have regular structure, and spatial parallelism is 
mostly available at pixel level. They are well suited for both SIMD and M1MD type of 
processing. If communication between processors is fast enough, almost linear speedups 
are possible. Therefore, an architecture for IVS should be capable of efficiently executing 
low level algorithms and algorithms suited for SIMD type of processing. Also, data I/O 
should not be a bottleneck because otherwise, speedups through parallelism can be 
nullified. Examples of low level algorithms include most transforms, filtering algorithms, 
and convolution algorithms. 

High level algorithms are not well understood. They are normally global data 
dependent, involve more complex data structures (compared to pixel representation), and 
need varying communication for parallel processing. These type of algorithms are more 
suited for MIMD type of processing. Hence, the architecture should be capable of execut- 
ing MIMD algorithms efficiently. 

2.3. Features and Capabilities of Parallel Architectures for IVSs 

The following are the architecture requirements for a multiprocessor architecture to 
be suitable for integrated vision systems. First, the ability to transform image data (pixel 
data) into a set of meaningful symbols that describe it. Second, the ability to process 
pixel and symbol data in parallel as well as concurrently. Third, the ability to simultane- 
ously maintain low, intermediate and high level representations, and the ability to per- 
form low, intermediate and high level algorithms simultaneously on inter-related or 
independent data. Fourth, fast I/O and processing rates for huge amounts of data at all 
levels of computations. Fifth, the ability to select particular subsets of data for varying 
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types of processing. Finally, the ability to perform top-down as well as bottom-up pro- 
cessing efficiently, and the ability to report the results efficiendy. These are some of the 
broad requirements for an architecture for integrated vision systems. From the above dis- 
cussion we can transform the requirements into specific architecture requirements as 
presented below. 

(1) Reconfigurability: From the model and the preceding discussion it is clear that mul- 
tiple levels of representations and stages of processing are essential and require very 
different types of processing. Hence, the architecture should be capable of executing 
both SIMD and MIMD type computations efficiently. That is, it should be possible 
to reconfigure the architecture such that each algorithm can be implemented 
efficiendy using the most suited mode of computation. 

(2) Flexible Communication: Fine grained and high speed communication is required 
both among the processes at each level and between the different processing levels. 
The communication requirements vary for different algorithms. The communication 
pattern between processors executing subtasks of a larger task depends on the algo- 
rithm involved in the task. If the connectivity between processors is too rigid then 
the communication overhead of intratask and intertask communication may become 
prohibitive. Therefore, it is desirable that the communication be flexible in order to 
provide the most efficient communication with low overhead. 

(3) Resource Allocation and Partitionability: As we discussed earlier, there are several 
tasks with vastly different characteristics and computational requirements in an IVS. 
These tasks need to exist simultaneously in the system. Therefore, the system 
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should be partitionable into many independently controlled subsystems to execute 
each task. Since the high level algorithms exhibit varying level of parallelism and 
data dependent performance, it should be possible to allocate resources (such as 
processors, memory) dynamically to meet the performance requirements. 

(4) Load Balancing and Task Scheduling: Load balancing and task scheduling are very 
important, especially for high level jdsion algorithms, which are data dependent, 
and therefore, in order to obtain better utilization of resources and better speedups, 
dividing the computation equally among the processor is critical [44], The underly- 
ing architecture on which load balancing is done and the type of algorithm(s) 
involved contribute significantly to how well load balancing can be achieved. In low 
level algorithms since the computations are data independent, partitioning data 
equally among the processors normally balances the load among them. However, 
for high level algorithms, more sophisticated load balancing and scheduling stra- 
tegies are needed. The architecture should include features such that it is easy to 
perform load balancing and task scheduling and that the overhead of doing so is 
minimal . 

(5) Topology and Data Size Independent Mapping: For a system as complex as an IYS, 
if the underlying architecture and its interconnect is rigid such that the problem size 
that can be solved on it or how it can be mapped is tied to the interconnection, the 
effectiveness of the architecture will diminish as an architecture for an IVS. 

(6) Fault-Tolerance: Fault-tolerance is an important part of a system of such complex- 
ity. A failure in a processor or communication structure should not affect the per- 
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formance drastically, which is normally the case when rigid interconnections are 
present between processors. The architecture should provide for graceful degrada- 
tion in case of failures. 

(7) Input-Output: It is most often the case that an architecture is able to perform very 
well on some algorithms, and high speedups are obtained, but input-output (I/O) of 
data is inefficient. I/O is an integral part of a system and if it is a bottleneck then 
performance of the system will be limited. 

2.4. Examples of Integrated Vision Systems 

2.4.1. Image understanding benchmark 

Recently, a DARPA sponsored effort has been directed towards developing bench- 
marks to evaluate architectures for integrated vision systems, and the benchmarks and 
rationale behind it appeared in [1]. We will briefly present a discussion on the bench- 
mark as it partly represents an integrated vision systems. The following are some of its 
features. The benchmark involves a simple image domain with well-defined, well- 
behaved objects. It requires both bottom-up (data-directed) and top-down (knowledge or 
model-directed) processing. The top-down processing can involve processing of low and 
intermediate level data to extract additional features from the data, or can involve control 
of low and intermediate level processes to reduce the total amount of computation 
required. It tests low level operations such as convolution, thresholding, connected com- 
ponent labeling, edge tracking, median filter, hough transform, convex hull, and comer 
detection. It requires utilization of information from two sensors in order to complete the 
interpretation process. It tests grouping operations and graph matching, as representative 
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examples of intermediate level and high level processing, respectively. It requires use of 
both integer and floating point representations. Finally, it tests the communication chan- 
nels between symbolic and numeric levels of processing. The reader is referred to [1] for 
more details. 

The benchmark includes most characteristics of a typical integrated vision system, 
or at least is a good representation of what type of processing may be needed in such a 
system. However, it does not span the entire spectrum. Most important, the benchmark 
does not include motion information. That is, it does not capture the real-time image 
input or time-varying information processing. Our model, presented earlier in this 
chapter, tries to capture the time varying characteristics of an integrated system and cap- 
tures most characteristics of the integrated benchmark presented in [1]. Furthermore, it 
provides a framework to develop benchmarks in the future even though several 
refinements in the model need to be performed and a more detailed view has to be pro- 
vided. 

2.4.2. Motion estimation and object recognition 

In this system, sequence of images of a scene containing moving object(s) is used to 
compute the motion of the object(s) in the scene, and using the motion parameters and 
features from the images, object recognition is performed [45]. The computation 
involves extracting zero crossings (convolution, template matching and thresholding), 
stereo matching (graph matching and grouping ) of features, hough transform, and model 
directed object recognition in which features obtained from the image data are correlated 
with the features of the set of model objects in order to obtain the best match. Therefore, 



37 


this system also involves both bottom-up and top-down processing. Figure 2.2 shows the 
computational flow for the motion estimation system in which stereo images ( Limjtim ) at 
each time frame are used as the input to the system. Briefly, the involved tasks in this 
system are as follows. The first algorithm in the system is computation of zero crossings 
in the images (edge detection ( Lzc and Rzc)). The zero crossings are used to perform 
stereo match between the two images of the same time frame. The stereo match algo- 
rithm provides points to compute 3-D information about the object in the scene. Using 
these matched points ( Lsm and Rsm), the corresponding points in the image in the next 
time frame ( Lem ) are located, and this task is performed by time match algorithm. Again, 
stereo match is used to obtain the corresponding 3-D points in the next image frame. 
These two sets of points provide information to compute the motion parameters. Using 



RM: Recognition Module 


Figure 2.2 : Computation Flow for Motion Estimation 
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the motion parameters and the information from the models, object recognition module 
performs the task of picking the best match between models and the information from 
image data. The above process is repeated for each new set of input image frames. 

Hence, such a system, in addition to exhibiting many of the properties included in 
the image understanding benchmark, also captures motion information. Therefore, 
several architecture features pertaining to real-time processing, and fast I/O processing 
can be evaluated using such a system. 
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CHAPTER 3. 

ARCHITECTURE OF NETRA 


This chapter contains a detailed description of the architecture of NETRA. The first 
four sections describe the components of NETRA, their functions, capabilities and 
features. The last section critically examines the architecture in view of the computa- 
tional requirements for IVS developed in the previous chapter. 

Figure 3.1 shows the architecture of "NETRA," which is a recursively defined 
hierarchical multiprocessor system and provides distributed as well as shared memory 
environment. The architecture consists of the following components : 

(1) A large number (1000 - 10000) of Processing Elements (PEs), organized into clus- 
ters of 16 to 64 PEs each. 

(2) A tree of Distributing-and-Scheduling-Processors (DSPs) that make up the task dis- 
tribution and control structure of the multiprocessor. 

(3) A parallel pipelined shared Global Memory and a Global Interconnection that links 
the PEs and DSPs to the Global Memory. 

3.1. Processor Clusters 

The clusters consist of 16 to 64 PEs, each with its own program and data memory. 
Each PE is a general purpose processor with a high speed floating point capability. They 
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DSP : Distributing and Scheduling Processor 
C : Processor Ouster M : Memory Module 

Figure 3.1 : Organization of NETRA 

form a layer below the DSP-tree, with a leaf DSP associated with each cluster. PEs 
within a cluster also share a common data memory. The PEs, the DSP associated with the 
cluster, and the shared memory are connected together with a crossbar switch. The 
crossbar switch permits point-to-point communications as well as selective broadcast by 
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the DSP or any of the PEs. Figure 3.2 shows the cluster organization. A 4x4 crossbar is 
shown as an example of the implementation of the crossbar switch. The crossbar design 
consists of pass transistors connecting the input and output data lines. The switches are 
controlled by control bits indicating the connection pattern. If a processor or DSP needs 
to broadcast, then all the control bits in its row are made one. In order to connect proces- 
sor Pi to processor P jy control bit (i j) is set to one, and the rest of the control bits in row i 
and column j are off. 

Clusters can operate in SIMD mode, systolic mode, or MIMD mode. In an SIMD 
mode, PEs in a cluster execute identical instruction streams from private memories in a 
lock-step fashion. In systolic mode, PEs repetitively execute an instruction or set of 
instructions on data streams from one or more PEs. In both cases, communication 
between PEs is synchronous. In MIMD mode, PEs asynchronously execute instruction 
streams resident in their private memories. The streams may not be identical. In order to 
synchronize the processors in a cluster, a synchronization bus is provided which is used 
by processors to indicate to the DSP that a processor(s) has finished its computation or a 
processor wants to change the communication pattern. The DSP can either poll the pro- 
cessors or the processors can interrupt the DSP using the synchronization bus. 

3.1.1. Crossbar design 

There is no arbitration in the crossbar switch. That is, the interconnection between 
processors has to be programmed before processors can communicate with each other. 
Programming a crossbar requires writing a communication pattern into the control 
memory of the crossbar. A processor can alter the communication pattern by updating 
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Figure 3.2 : Organization of Processor Cluster 


the control memory as long as it does not conflict with the existing communication pat- 
tern. The DSP associated with a cluster can write into the control memory to alter the 
communication pattern. The most common communication patterns, such as linear 
arrays, trees, meshes, pyramids, shuffle-exchanges, cubes, broadcast, can be stored in the 
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memory of the crossbar. These patterns need not be supplied externally. Therefore, 
switching to a different pattern in the crossbar can be fast because switching only 
requires writing the patterns into the control bits of the crossbar switches from its control 
memory. 

The advantages of such a crossbar design are the following: First, since there is no 
arbitration, the crossbar is relatively faster than one which provides arbitration because 
switching and arbitration delays are avoided. Second, it is easier to design and implement 
the crossbar because arbitration is absent, and therefore, switches are simple. Further- 
more, it is possible to implement systolic algorithms using the crossbar because it can 
transfer data at the same or greater speed than required by the systolic computation. Such 
a crossbar is easily scalable. Unlike other interconnections, such as cubes and shuffle- 
exchanges, the scalability need not be in power of 2. A unit scalability is possible. Furth- 
ermore, for the same reason, it is easy to provide fault-tolerance because one spare pro- 
cessor can replace any failed processor, and one extra crossbar link can replace any failed 
link. This is possible because there is no inherent structure that connects the processor 
and each processor, (link) is topologically equivalent to any other processor Gink). 

3.1.2. Scalability of crossbar 

Figure 3.3a) depicts a 1 bit 4x4 crossbar switch. In order to obtain byte or word 
parallel crossbar, the crossbar switches can be stacked together as shown in Figure 3.3b). 
The control, address and communication pattern information is exactly the same in all 
the stacked switches. Figures 3.3c), d) and e) illustrate the size scalability. Figure 3.3c) 
shows how a 4x8 crossbar can be obtained from two 4x4 crossbars. Similarly, Figures 
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3.3d) and e) illustrate how 8x4 and 8x8 crossbars can be obtained, respectively. Note that 
the smallest switch need not be a bit crossbar. Depending on the technology and availa- 
bility of the I/O pins, it can be of any size (such as 4 bit or a byte). Furthermore, depend- 
ing on the available pins, it can be a 16x16 or 32x32 bit crossbar. Finally, sizes of the 
crossbar need not be a multiple of two but can be any arbitrary. 







e) 8X8 


Figure 3.3 : Scalability of Crossbar 
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3.2. The DSP Hierarchy 

The DSP-tree is an n-ary tree with nodes corresponding to DSPs and edges to bi- 
directional communication links. Each DSP node is composed of a processor, a buffer 
memory, and a corresponding controller. 

The tree structure has two primary functions. First, it represents the control hierar- 
chy for the multiprocessor. A DSP serves as a controller for the subtree structure under it. 
Each task starts at a node on an appropriate level in the tree, and is recursively distributed 
at each level of the subtree under the node. At the bottom of the tree, the subtasks are 
executed on a processor cluster in the desired mode (SIMD or MIMD) and under the 
supervision of the leaf DSP. 

The second function is that of distributing the programs to leaf DSPs and the PEs. 
Vision algorithms are characterized by a large number of identical parallel processes that 
exploit the spatial parallelism and operate on different data sets. It would be highly 
wasteful if each PE issued a separate request for its copy of the program block to the 
global memory because it would result in unnecessary traffic through the interconnection 
network. Under the DSP-hierarchy approach, one copy of the program is fetched by the 
controlling DSP (the DSP at the root of the task subtree) and then broadcast down the 
subtree to the selected PEs. Also, DSP hierarchy provides communication paths between 
clusters to transfer control information or data from one cluster to others. Finally, the 
DSP-tree is responsible for Global Memory management 
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3.3. Global Memory 

The multiport global memory is a parallel-pipelined structure as introduced in [46]. 
Given a memory(chip)-access-time of T processor-cycles, each line has T memory 
modules. It accepts a request in each cycle and responds after a delay of T cycles. Since 
an L-port memory has L lines, the memory can support a bandwidth of L words per cycle. 

Data and programs are organized in memory in blocks. Blocks correspond to "units” 
of data and programs. The size of a block is variable and is determined by the underlying 
tasks and their data structures and data requirements. A large number of blocks may 
together constitute an entire program or an entire image. Memory requests are made for 
blocks. The PEs and DSPs are connected to the global memory with a global intercon- 
nection network. 

The global memory is capable of queuing requests made for blocks that have not yet 
been written into. Each line (or port) has a Memory-line Controller (MLC) which main- 
tains a list of read requests to the line and services them when the block arrives. It main- 
tains a table of tokens corresponding to blocks on the line, together with their length, vir- 
tual address and full/empty status. The MLC is also responsible for virtual memory 
management functions. 

Two main functions of the global memory are input-output of data and program to 
and from the DSPs and processor clusters, and to provide inter-cluster communication 
between various tasks as well as within a task if a task is mapped onto more than one 


cluster. 
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3.4. Global Interconnection 

Currently, two alternative global interconnection schemes are being evaluated. First 
is a high speed bus which is connected to one port from each cluster and to the global 
memory. The second is a multistage interconnection network connecting the global 
memory and cluster processors. 

3.4.1. Interconnection network 

The PEs and the DSPs are connected to the Global Memory using a multistage 
circuit-switching interconnection network. Data is transferred through the network in 
pages. A page is transferred from the global memory to the processors which is given in 
the header as a destination port address and the header also contains the starting address 
of the page in the global memory. When the data is written into the global memory, only 
the starting address needs to be stated. In each case, end-of-page may be indicated using 
an extra flag bit appended to each word. 

3.4.2. Global bus 

We are evaluating an alternative strategy to connect DSPs, clusters and the global 
memory using a high speed bus. In this organization one port of each cluster will be con- 
nected to the high speed bus. Also, each DSP will be connected to the bus. Processors 
that need to communicate with processors in other clusters use explicit messages to send 
and receive data from the other processors. Figure 3.4 illustrates this method. A proces- 
sor Pi in cluster C, can send data to a processor Pj in cluster C ; as shown in the Figure. P, 
sends the data to the DSP t , which sends the data to DSPj in a burst mode. DSPj then sends 
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the data to the processor P r We are evaluating both alternatives for intercluster commun- 
ication. 

3.5. IVS Computation Requirements and NETRA 

In the following discussion we examine the architecture in the light of requirements 
for an IVS discussed in Chapter 2. 

Reconfigurability (Computation Modes) 

The clusters in NETRA provide SIMD, MIMD and systolic capabilities. As we dis- 
cussed earlier, it is desirable to have these modes of operations in a multiprocessor sys- 
tem for IVS so that all levels of algorithms can be executed efficiently. For example, 
consider matrix multiplication operation. We will show how it can be performed in 


GLOBAL BUS 



Figure 3.4 : An Alternative Strategy for Inter-Cluster Communication 
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SIMD and systolic modes. Assume that the computation requires obtaining matrix 
C =AxB. For simplicity, also assume that the cluster size is P and the matrix dimensions 
are PxP. Note that this assumption is made to simplify the example description. In gen- 
eral, any arbitrary size computation can be performed independent of the data or cluster 
size. 

SIMD Mode 

The algorithm can be mapped as follows. Each processor is assigned a column of 
the B matrix, i.e., processor P t is assigned column fl;. Then the DSP broadcasts each row 
to the cluster processor, which computes the inner products of the rows with their 
corresponding columns in lock-step fashion. Note that the elements of the A matrix can 
be continuously broadcast by DSP, row by row without any interruptions, and therefore, 
efficient pipelining of data input, multiply, and accumulate operations can be achieved. 
Figure 3.5a) illustrates a SIMD configuration of a cluster. The following pseudo code 
describes the DSP and processor (P k 's program, 0<£<P-1) program. 
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SIMD Computation 
DSP P k 


1 . 

FOR i=0 to i=P-l DO 

1 . 

2. 

connect ( DSP^Pi ) 

2 . - 

3. 

out( column BO 

3. in(columnBi) 

4. 

ENDFOR 

4. - 

5. 

connect(DSP, all) 

5. - 

6. 

FOR i=0 to i=P-l DO 

6. =0 

7. 

FOR j=0 to j=P-l DO 

7. FOR j=0 to j=P-l DO 

8. 

out(aij) 

8. in(dij) 

9. 

END FOR 

9. c ik = c ik + a i j*bj k 

10. 

ENDFOR 

10. END FOR 


In the above code, the computation proceeds as follows. In the first three lines, the 
DSP connects with each processor through the crossbar and writes the column on the out- 
put port. That column is input by the corresponding processor. In statement 5, the DSP 
connects with all the processors in a broadcast mode. Then, from statement 6 onward, the 
DSP broadcasts the data from matrix A in row major order, and each processor computes 
the inner product with each row. Finally, each processor has a column of the output 
matrix. It should be mentioned that the above code describes the operation in principle 
and does not exaedy depict the timing of operations. 
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Systolic Computation 


DSP Pi 


1. FOR i=0 to i=P-l DO 

2 . connect( DSP ,P t ) 

3. out( column Bi) 

4. out(rowAi) 

5. ENDFOR 

6. connect(Pi to P i+ \ mod P) 

7. 

8 . 

9. 

10 . 

11 . 


1. 

2 . - 

3. in(columnBi) 

4. inf column Ai) 

5. - 

6 . c,i=0 

7. FOR j=0 to j=P-l DO 

8. cu — Ca + dij*bji 

9. outfaij), infdi-y) 

10. END_FOR 

11. repeat 7-10 for each new row 


Systolic Mode 

The same computation can be performed in a systolic mode. The DSP can 
reconfigure the cluster in a circular linear array after distributing columns of matrix B to 
processors as before. Then DSP assigns row A t of matrix A to processor P,. Each proces- 
sor computes the inner product of its row with its column and at the same time writes the 
element of the row on the outout port. This element of the row is input to the next proces- 
sor. Therefore, each processor receives the rows of matrix A in a systolic fashion, and 
the computation is performed in the systolic fashion. Note that the computation and com- 
munication can be efficiently pipelined. In the code, it is depicted by statements 7-10. 
Each element of the row is used by a processor and immediately written onto the output 
port, and at the same time, the processor receives an element of the row of the previous 
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processor. Therefore, every P cycles a processor computes new element of the C matrix 
from the new rows it receives every P cycles. Again, note that the code describes only the 
logic of the computation and does not include the timing information. Figure 3.5b) illus- 
trates a systolic configuration of a cluster. 

Partitioning and Resource Allocation 

There are several tasks with vastly different characteristics in an IVS, and therefore, 
the number of processors needed for each task may be different and may be needed in 
different computational modes. Hence, partitionability and dynamic resource allocations 
are keys to high performance. Much effort has been devoted towards investigating the 
partitionability of interconnection networks [47,48]. Approaches such as in [37,48] are, 
however, required only when processes are tightly coupled. In the above case physical 



a ) SIMD Mode 



b) Systolic Mode 

Figure 3.5 : An Example of SIMD and Systolic Modes of Computation in a Cluster 
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partitions are established. In other words, links are reserved for specific point-to-point 
communication while a process executes. Whenever a new process is to be instantiated, 
required resources should be free and linked together in a specified manner. The parti- 
tioning is, in effect, isolated from the rest of the system. 

Partitioning in NETRA is achieved as follows. When a task is to be allocated, the 
set of subtrees of DSPs is identified such that the required number of PEs is available at 
their leaves. One of the subtrees is chosen on the basis of characteristics of the task, 
locality constraints and load balancing considerations. The chosen DSP represents the 
root of the control hierarchy for the task. Together with the DSPs in its subtree, it 
manages the execution of the task. Note that partitioning is only virtual. The PEs are not 
required to be physically isolated from the rest of the system. Once the subtree is chosen, 
the processes may execute in SIMD, MIMD or systolic mode. The following are some of 
the advantages of such a scheme. First, only one copy of the programs needs to be 
fetched, thereby reducing the traffic through the global interconnection network. Second, 
simple load balancing techniques may be employed while allocating tasks. The tasks of 
global memory management can be distributed over the DSP tree by assigning it to the 
DSP at the root of the subtree executing the subtask. Finally, locality is maintained 
within the control hierarchy, which limits the intratask communication to within the sub- 
tree. 

Load Balancing and Task Scheduling 

Two levels of load balancing need to be employed, namely, global load balancing 
and local load balancing. Global load balancing aids in partitioning and allocating the 
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resources for tasks as discussed earlier. Local load balancing is used to distribute compu- 
tations (data) to processors executing subtasks of a larger task. Local load balancing can 
be either static or dynamic or a combination of both. With static load balancing, given a 
task, its associated data and the number of processors allocated for the task, the data is 
partitioned in such a way that each processor gets an equal or a comparable amount of 
computation [44], In dynamic load balancing, the subtasks are dynamically assigned to 
the processors as and when they finish the previously assigned tasks. In NETRA, when a 
task is assigned to a subtree, the DSPs involved perform the local load balancing func- 
tions. 

Using the information from local load balancing and other measures of computa- 
tion, global load balancing can be achieved hierarchically by using the DSP hierarchy. In 
this scheme, each controller DSP sends its measure of load to its parent DSP and the root 
DSP receives the load information for the entire system. The root DSP then broadcasts 
the measure of load of the entire system to the DPSs. When a task is to be allocated, 
these measures can be used to select a subtree for its execution as follows: If any subtree 
corresponding to the child of the current DSP has an adequate number of processors then 
the task is transferred to a child DSP with the lowest load; else if the current subtree has 
enough resources and the load is not significantly greater than the average system load 
then the task is allocated to the current subtree; else the current DSP transfers the task to 
the parent DSP. 
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Flexible Communication 

Availability of flexible communication is critical to achieving high performance. 
For example, when a partition operates in SIMD mode there is a need to broadcast the 
programs. When a partition operates in MIMD mode, where processors in the partition 
cooperate in the execution of a task, one or more programs need to be transferred to the 
local memories of the processors. Performing the above justifies a need for selective 
broadcast capability. In order to take advantage of spatial parallelism in vision tasks, pro- 
cessors working on neighboring data need to communicate fast among themselves for 
high performance. The programmability and flexibility of the crossbar provide fast local 
communication. Most common vision algorithms, such as FFTs, filtering, convolution, 
counting, and transforms need a broad range of processor connectivities for efficient exe- 
cution. These connectivities include arrays, pipelines, several systolic configurations, 
shuffle-exchanges, cubes, meshes, and pyramids. Each of these connectivities may per- 
form well for some tasks and badly for others. Therefore, using a crossbar with a selec- 
tive broadcast capability, any of the above configurations can be achieved, and conse- 
quently, optimal performance can be achieved at the clusters. 

Several techniques for implementing reconfigurability between a set of PEs were 
studied [47,49]. It was discovered that using a crossbar switch to connect all PEs was 
simpler than any other schemes. The popular argument that crossbar switches are expen- 
sive was easily thwarted. When designing communication networks in VLSI, the primary 
constraint is the number of pins and not the chip area. The number of pins is governed by 
the number of ports on the network and is independent of the type of network. Further- 
more, it was realized that a crossbar with a selective broadcast capability was not only a 
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very powerful and flexible structure, but was also simpler, scalable, and less expensive. 

The need for global communication is relatively low and infrequent. Global com- 
munication is needed for intertask communication, i.e., from one task to another in the 
IYS pipeline. It is also needed to input and output data, to transfer data within a subsys- 
tem when a task is executed on more than one cluster, and finally, it is needed to load the 
programs. The most important issue in global communication is that the network speed 
should be matched with the crossbar speed as well as with the processors speed. The glo- 
bal communication is performed through the global memory using the interconnection 
network, or using the DSP hierarchy. Another alternative we consider is connecting all 
the clusters and DSPs to a global bus. Since the DSPs perform most control functions and 
loading of programs and data, the responsibility of intertask communication does not lie 
with the DSP hierarchy. 

I/O and Global Memory Access 

The global memory is equally accessible from all the processors and DSPs in the 
system. Input-Output of data from (to) sensors and other I/O devices is performed 
through the global memory. Since the global memory lies in the address space of each 
processor and provides a uniform view across the architecture, I/O is uniformly distri- 
buted. Therefore, there are no I/O bottlenecks in the system. Furthermore, the global 
memory provides a uniform access to the shared database, which may contain models 
and other system data. 

A large system such as NETRA implies a large memory and a large interconnection 
network. Therefore, the response times to memory requests can be large and variable in a 
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nondeterministic manner, due to conflicts in accessing global memory and interconnec- 
tion network. Hence, there is a need for the PEs to be able to issue multiple requests in 
advance and accept out-of-order requests. 

NETRA is a multiprogrammed system with a large number of processes active at 
any time. A process becomes active when a token corresponding to the process is entered 
into the Active Queue of a PE (for an MIMD process) or a cluster (for a SIMD process). 
Data requests for the required input data blocks are immediately issued. When all input 
data blocks for a process are available, it is transferred to the Ready Queue. However, 
while these requests are serviced, the PEs continue to execute process already in their 
Ready Queue. Access to memory for one process is thus overlapped with execution of 
another. Multitasking at the PE level, therefore, permits each PE to tolerate large and 
undeterministic memory access latencies. Since the assumption is that such a system will 
be executing an integrated vision system, as we observed in Chapter 2, there will be 
enough processes available and active all the time. Furthermore, the future tasks will be 
somewhat predictable because order of the tasks is known from the model of computa- 
tion and control flow of the system. 

3.6. Comparison of NETRA with Other Architectures 

The following discussion presents a comparison of NETRA with other architecture 
proposals. Some parts of the discussion in this section have been presented in [3]. 

Partitionability 

PM4 and PASM support partitions that contain one or more of a group of proces- 
sors. Each group has a fixed size. These systems are, therefore, likely to suffer from 
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"internal fragmentation," that is, when the number of PEs to be used is not a multiple of 
the number present within each group, some processors within one or more groups will 
remain idle, and processor utilization will suffer. REPLICA supports partitions of any 
size by allowing each group to execute more than one SIMD and MIMD processes. This 
capability is by virtue of a capability-enhanced crossbar switch used to connect the PEs. 
NETRA, too, provides a similar crossbar at the cluster level and is, hence able to support 
independent partitions within the cluster. This features eliminates internal fragmentation 
and improves processor utilization. 

Setup of Partitions 

The second aspect of partitionability is that while NETRA establishes virtual parti- 
tions, physical partitions are established in the other cases. NETRA does not partition the 
pool of processors into isolated subsystems. Instead, it merely allocates processes to sub- 
sets of processors. The process of partitioning involves only placing the token for a pro- 
cess in the appropriate queue. In the case of SIMD process, the tokens are placed on the 
active queues of DSPs controlling the selected clusters. For MIMD processes, the tokens 
are placed in the active queues of the selected PEs. SIMD processes are initiated by the 
controlling DSP. MIMD processes are, however, started when the token reaches the head 
of the ready queue within the PE. If the MIMD processes need to synchronize in any 
way, they must be started and synchronized by the cluster-DSP. 

Multiprogramming 

NETRA is a multiprogrammed system and is most efficient if a large number of 
tasks are active at all times. Multitasking at the PE level has been used as a tool to 



59 


decouple system performance from large memory access latencies. As explained earlier, 
long delays for memory access are inherent in a large system, and the system has to be 
able to tolerate such delays without loss of performance. 

An intelligent memory system has been employed to support multitasking. This 
feature essentially permits the PEs to execute queued tasks without having to associate 
with the data-fetching operation. The above scheme is unique to NETRA, and its 
absence in any other proposals makes them far less suitable for large systems. 

Virtual Versus Physical Partitions 

In the case of PASM, PM4 and REPLICA, communication links are dedicated to 
partitions; consequently, all subsystems are isolated from each other. MIMD processes, 
especially executing intermediate and high level algorithms, exhibit widely varying exe- 
cution times since the amount of processing is data dependent Therefore, when rigid par- 
titions are used, processors would have to wait until all complete processing before they 
start executing another process. If the deviation in the processing times is great, such a 
waiting will reduce the processor utilization tremendously. 

A second factor in favor of virtual partitioning is that it allows for easy allocation of 
tasks that are dynamically created. An arbitrary number of such processes may be gen- 
erated, and their scheduling would be a difficult problem in all of the above proposals but 
NETRA. On NETRA these processes can be allocated as they are generated on the basis 
of load balancing and locality considerations alone. How the other systems would handle 
dynamically created tasks is not clear, but it would certainly require several global con- 
siderations. Since tasks would already have been scheduled to execute on specific 
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partitions, where the dynamically created processes would fit in would have to be deter- 
mined. The network would be able to support only selected partitions, and scheduling on 
the fly would be difficult. 

Scheduling and Load Balancing 

Fragmentation can be minimized in PM4, PASM and REPLICA only if scheduling 
is static or done considerably in advance of execution. This is because scheduling would 
involve global considerations such as partitionability of the interconnection network and 
availability of resources. A closer look, however, reveals a major difficulty. 

For efficient scheduling and preloading, the scheduler should know in advance what 
resources will be available at a given time. In other words, it should know when each 
process is to end. Clearly, this requires a considerable amount of determinism in the 
behavior of a process. While such information is easily available for SIMD tasks, it may 
be impossible to obtain it for most MIMD tasks. 

An essential implication is that processes cannot always be prescheduled. This in 
turns implies that data and programs for these processes cannot be prefetched. Conse- 
quently, processor utilization suffers drastically. 

The scheduling scheme on NETRA is clearly simple and superior. The scheduling 
process (executing on a DSP) is not confronted with a large volume of information. It 
needs to consider only the average load on the subtree below it and the overall average 
load of the system. As soon as a process is placed in an active queue, a request for the 
input data blocks required is issued. Prescheduling and data prefetching are thus easily 
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accomplished. A hierarchical control-structure (the DSP-tree) and simple scheduling and 
load balancing heuristic are, therefore, able to provide for high performance. 
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CHAPTER 4. 

PARALLEL ALGORITHMS ON A CLUSTER 


There are two main considerations in the mapping of parallel algorithms onto a 
cluster. First is selection of a computation mode such as SIMD, MIMD or systolic, and 
the second is the number of available processors on a cluster and selection of the best 
way to map the algorithm. For a data dependent algorithm, there may be need for nonuni- 
form data partitioning and local load balancing. The load balancing scheme may be static 
or dynamic. In the static scheme, the DSP in a cluster allocates tasks to the processors 
using some a priori knowledge about the computation such that each processor receives 
an average amount of computation. Under the dynamic load balancing scheme the DSP 
maintains a queue of ready tasks and assigns the tasks to the available processors as they 
become free to execute the next task. 

The methodology we use for mapping parallel algorithms is multidimensional, 
divide-and-conquer with medium to large grain parallelism. An individual task (in the 
following discussion task and algorithm are used interchangeably) can be efficiently 
mapped using spatial parallelism, because most of the vision algorithms are performed 
on two dimensional data. However, integration of tasks involves exploiting both spatial 
as well as temporal parallelism can be exploited by recognizing intertask data dependen- 


cies. 
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The purpose of this chapter is to evaluate performance of several common vision 
algorithms when mapped onto a processor cluster. The discussion identifies the computa- 
tion modes suitable for an algorithm, and suggests alternatives to map an algorithm. 
Furthermore, performance evaluation of each algorithm is presented using accurate 
analysis, and the analytical results of some of the algorithms are compared with the 
implementation results. It is shown that analytical results are very close to the implemen- 
tation results. The analysis provides the flexibility to vary several parameters, and there- 
fore, it is easier to study the effects of alternative approaches. 

This chapter is organized as follows. Section 4.1 presents a classification of some 
common vision algorithms based on their computation and communication requirements. 
Section 4.2 briefly outlines alternative mapping strategies on a processor cluster. Section 
4.3 contains mappings and analytical performance results for one algorithm from each 
class and discusses alternative mappings for some algorithms. Parallel implementation of 
some of the algorithms is presented in Section 4.4 and the results are compared with 
analytical results. Performance of two algorithms from Image Understanding Benchmark 
developed by Weems et al. [1] is also presented. 

4.1. Classification of Common Vision Algorithms 

We can classify some of the common vision algorithms according to their commun- 
ication requirements when mapped onto parallel processors. The classification provides 
an insight into the performance of an algorithm depending on its communication require- 


ments. 
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(1) Local Fixed - In these algorithms, the output depends on a small neighborhood of 
input data in which the neighborhood size is normally fixed. Sobel edge detection, 
image scaling, and thresholding are examples of such algorithms. Figure 4.1a) illus- 
trates that the output at point (x,y) depends on fixed size neighborhood values. 

(2) Local Varying - Like the local fixed algorithms, the output at each point depends on 
a small neighborhood of input data. However, the neighborhood size is an input 
parameter and is independent of the input image size. Convolutions, edge detection 
and most other filtering and smoothing operations are examples of such algorithms. 
Local varying is depicted by Figure 4.1b) in which it is shown that the output at 
point (x,y) depends on a varying size neighborhood which is normally an input 
parameter. 

(3) Global Fixed - In such algorithms each output point depends on the entire input 
image. However, the computation is normally input data independent (i.e., computa- 
tion does not vary with the type of image and only depends on the size of the 
image). Two Dimensional Discrete Fourier Transform and Histogram computation 
are examples of such algorithms. Figure 4.1c) illustrates that output at a point (x,y) 
is dependent on all the input data points. 

(4) Global Varying - Unlike global fixed algorithms, in these algorithms the amount of 
computation and communication depends on the image input as well as its size. 
That is, the output may depend on the entire image or may depend on a part of the 
image. In other words, the computation is data dependent. Hough Transform, Con- 
nected and Component Labeling are examples of such algorithms. For example, in 
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an image, a connected component may span only a small region, or in the worst 
case the entire image may be one connected component (a spiral). Similarly, in 
case of hough transform, assuming that we are looking for lines, a line may span 
across the image (meaning its votes must come from distant pixels or edges) or it 
may be localized. Figure 4. Id) shows that the output of an algorithm may depend 
on global data, and the computation is input data dependent. 


N 



N 


N 



N 


a 

N 

¥ 

c) Global Fixed 

Figure 4. 1 : Classification of Common Vision Algorithms 
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4.2. Issues in Mapping an Algorithm 

Mapping a task on one cluster implies that intratask communication will only 
involve communication between processors of the same clusters. Figure 4.2 shows how 
a parallel algorithm is mapped on a cluster. Assume that there are P processors in a clus- 
ter. As shown in Figure 4.2, first program and data are loaded onto the processor cluster. 
Both in the case of SIN® or MIMD mode, the program is broadcast onto the cluster pro- 
cessors. The data division depends on the particular algorithm. If an algorithm is mapped 
in SIMD or systolic mode, 4 then the compute and communication cycles will be inter- 
mixed. If an algorithm is mapped in MIMD mode, then each processor computes its par- 
tial results and then communicates with others to exchange or merge data. 

The total processing time in such a mapping consists of the following components. 
Program load time onto the cluster processors (t pt ), data load and partitioning time (r^), 
computation time of the divided subtasks on the processors (tcp) which is the sum of the 
maximum processing time on a processor P, and intra-cluster communication time 
( tcomm ), and the result report time (to). t<u consist of three components: 1) data read time 
from the global memory (t r ) by the cluster DSP, 2) crossbar switch setup time (r w ), and 
3) the data broadcast and distribution time onto the cluster processors (t br ). The total pro- 
cessing time x(P) of the parallel algorithm is given by 

t(P) — tpi + tji + t cp + t „ (4.1) 

where, 

t<iZ = t r + t setup tfer 

and if the computation and communication do not overlap, then 


(4.2) 
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Next Task 

Figure 4.2 : Mapping Algorithms on One Cluster 


?cp ~ if Pi) + tcomm (4.3) 

else if computation and communication can be completely overlapped, then 

t C p = MAX if Pi) * tcorrun ) (4.4) 

In the above equations, t r depends on the effective bandwidth of the global interconnec- 
tion network. 

4.3. Performance Evaluation of Parallel Algorithms 

In the following we illustrate how algorithms can be mapped in SIMD, systolic, and 
MIMD modes onto a cluster, and show how algorithms from different classes can be 
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mapped onto the cluster. In the evaluation we discuss the computation, communication 
and storage requirements for the algorithms. 

Table 4. 1 shows the parameters used for performance evaluation. These parameters 
are used for all the analysis and implementation unless specified otherwise. 

4.3.1. 2-D convolution 

2-D Convolution is a local varying type of algorithm. A 2-D convolution of an NxN 
image I(ij), Ck=i,j<=N, with a kernel W(i,j), 0<=d,j<=w 1 can be expressed as follows : 

m^j+w/2 nm+w/2 

G ( i,j ) = £ £ / {n,m)*W ((i +w/2-n) mod w, (J +w 12 -ft) mod w) 

m»y- w/2 n**i— w/2 

In other words, each point in the output is replaced by a weighted sum of a window w xw 
around it. 

The approach is to reduce 2-D convolution to a 1-D convolution with efficiency 1, 
i.e., without incurring additional steps. This mapping will illustrate how to map algo- 
rithms in SIMD and systolic modes on a processor cluster when the number of processors 
is much smaller than the problem size. Figure 4.3 shows a cluster of 64 processors. The 


Table 4.1 : Parameters for Performance Evaluation 


Total No. of Processors N„ 

512 

Cluster Size P e 

8-128 

No. of Processors/Port P„ 

4 

Image Size NxN 

512X512 

Memory Modules M 

128 

Processor Speed 

5 MIPS, 5 MFLOPS 

Network Speed (Block Transfer) 

20 Mbytes/Sec. 
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interconnection between processors shows an abstract representation of all the connec- 
tions required to perform the convolution operation. However, all the connections are not 
needed at the same time. We shall observe that only one input and one output connection 
is sufficient at any time, and that the flexibility of the crossbar can be used to obtain all 
the desired interconnections efficiently. 

Each pixel is logically mapped onto a separate processor (as if there were as many 
processors available as there are pixels). Actually the image is folded and multiple pixels 
are mapped onto one processor. The image is folded in two dimensions in a wrap around 
fashion, both left to right, and top to bottom. For a cluster size P, (assume P = pxp ), each 
processor has M = N 2 lP pixels in its local memory. In general, pixel (i,j) ; 
0<i<N-l, 0<j<N-l is mapped to processor ((i mod p), (j mod p)). Therefore, this map- 
ping preserves the adjacency of any two pixels even though the image is folded. 

Figure 4.3 shows the flow of the distribution of data for window size 5x5. A small 
window is embedded in a larger one, and therefore, the same connections can be used for 
a larger window size with the addition of new connections for extra steps. The algorithm 
performs the convolution by each processor distributing its pixel values to the neighbor- 
hood in a pipelined manner. 

In the following algorithm. North, South, East and West neighbors are defined in 
wrapped around fashion. At any step all the processors have the same neighbor connec- 
tion. Figure 4.3 shows how processor (3,3)’s values will be distributed. All the processors 
follow the same pattern. For a processor P(i,j), N,S,E,W neighbors are defined as fol- 
lows. Note that the following definition is only a logical definition, and it represents the 
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Figure 4.3 : Mapping on the Cluster for Convolution 
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pixel adjacency. The following definition does not imply any physical connections 
between processors. 

N = ((i-l)j), if (i-j) < 0, then N = ((i-1 + p), j) 

S = ((i+1) modp, j) 

E = (i, 0+1) mod p) 

W « (i, 0-1)). if 0-D < 0, then W = (i, (j-l+p)) 

The algorithm works as follows (Figure 4.4): The DSP broadcasts the convolution 
weights to all the processors. Each processor multiplies its M pixels with the central 
weight value. In Figure 4.4 the data values at each processor are stored in a linear array 
and subscript (ij) means the data value i in the connection number j. The intermediate 
values are stored in the running variable for each of the M pixels. The image is then 
shifted in a spiral manner (as shown in Figure 4.3). If the image is shifted North then the 
processors now multiply the pixel values with the South weight. This process is repeated 
w 2 -l time , i.e., for each weight. We make the following observations. First, the map- 
ping is independent of problem or cluster size. That is, this mapping will work for all 
problem sizes. Second, the number of times the interconnection needs to be changed only 
depends on the convolution kernel size. Furthermore, at any time only one input and one 
output connection is required. By storing the connection patterns in the crossbar memory 
the switching time is negligible. Third, it is possible to overlap the computation and com- 
munication by writing the pixel to the output port as soon as it is multiplied by the 
appropriate weight in the current processor. The above algorithm illustrates that SIMD 
algorithms can be mapped efficiently onto the processor clusters using the flexibility and 
programmability of the interconnection. 



ALGORITHM CONVOLUTION 
All the processors work in SIMD lock-step fashion. 

DSP broadcasts the convolution kernel. 

Set up Connection_array of size u >xw in the crossbar memory by choosing, 
first how connections from the set. 


{N£,S,S,W,W.N,N^,E,E,E,S,S,S,W,W,W,W,N,N,N,N3,..}. 
.. r iv 2 i 


For i = 1 to M do (in parallel) 


Result(i) := w i ( * data ( i ) 


EndFor 


Forj = 1 to wxw do (in parallel) 

Set up appropriate connections on the crossbar as follows. 

connection^) := connection_array(j) 

For i = 1 to M do (in parallel) 

Send data (pixels) on the output port to the connected 
neighbor. 

At the same time receive data from its input port. 
Result(i) := Result (i) + w i%J * data ( i,j ) 

EndFor 

EndFor 

END CONVOLUTION 


Figure 4.4 : 2-D Convolution 


The computation time decreases as the number of processors increases. The com- 
munication time per pixel only depends on the kernel size. The following formulae 
present the computation and communication times in terms of multiplication and addition 
operations. The factor t fl denotes the floating point speed of a processors in terms of its 
normal instruction execution speed. 
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tsw = w 2 -l 

hot ~ Mix(f C p t homm^^Tw) 

Figure 4.5 shows the performance of the 2D convolution on a processor cluster. 
The processing time has been computed assuming a 2 MFLOP processor. The Figure 
shows two speedup graphs, one with communication overlap and the other with additive 
communication. The computation time decreases linearly as the number of processors 
increases. The total communication time per processor also decreases linearly but the 
communication time per pixel computation remains constant. The important observation 
one can make is that it is essential that the communication and computation overlap in 
order to obtain linear speedups. However, if the interconnection speed is not matched 



Figure 4.5 : Performance of 2D Convolution on a Processor Cluster 
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with the computation speed, then overlap will not be possible. Having a fast crossbar 
without arbitration delays provides the necessary communication speed to obtain linear 
speedups. Note that since computation and communication can overlap, this mapping 
also illustrates how systolic algorithms can be mapped. 

4.3 2. Separable convolution 

Separable Convolution is a two-dimensional convolution broken into two one 
dimensional convolution. For applications such as computation of zero crossings, separ- 
able convolution performs well [50]. The main advantage of separable convolution is 
that the computation requirements per pixel are reduced from 2w 2 to 4w. We show how it 
can be mapped on a cluster. This example also illustrates how an algorithm can be 
mapped in MIMD mode on a cluster. 

The data is decomposed among the processors as follows. Each processor is 
assigned N/P rows of the data. Processor P, gets rows (i-l)xN/P to ixN/P - 1. Each pro- 
cessor computes convolution along the rows using a window of size w. Once processor P, 
finishes convolution along the rows, it needs rows (i-l)xN/P - w/2 to (i-l)*N/P - 1, 
from processor P t _i, and similarly, it needs the bottom w/2 rows from ixN/P to 
ixN/P +w/2 -1 from processor P i+l . Therefore, a processor needs to communicate with 
only two processors to obtain the desired intermediate data. The boundary processors P 0 
and Pp-\ only need to communicate with one other processor. Note that if the granule 
size with each processor is less than w/2 (i.e., NIP < w/2), then the processors need to 
exchange data with number of processors given below by r w . Now, each processor com- 
putes convolution along the columns in its granule. The following are computational and 
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communication requirements of the algorithm. 

fyxA/ 2 x4x(w/2+l) 



The amount of computation per pixel in separable convolution is a function of w for 
a wxw kernel unlike in 2D convolution where it is a function of w 2 . The amount of com- 
munication in separable convolution is fixed as shown in Figure 4.6. Therefore, the 
speedup is not as much as in the case of 2D convolution. There are two reasons for 
smaller speedup. First, the communication is not decomposable as a function of number 



Number of Processors 


Number of Processors 


Figure 4.6 : Performance of Separable Convolution on a Processor Cluster 
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of processors because each processor needs to exchange w/2 rows of intermediate results 
with two adjacent processors. Secondly, since the computation per pixel itself is small, 
the communication overhead as a fraction of computation time is large. 

4.3 3. Two-dimensional FFT (2D-FFT) 

2D-FFT is a Global Fixed algorithm. For an image I(k,l), 0<=k,l<=N, the 
corresponding 2D-FFT is given by 

F(m,n ) = N £ N £ I (k,l) e~ 2 *' ( * ,n+/ " yA ', 0<=m,n<=N-l 

kM) lm o 

where j = 4-T. A nice property of the 2D-FFT is that it can be performed in two 
decomposable steps : a one dimensional N point FFT along the rows followed by a one 
dimensional N point FFT of the intermediate results along the columns, or vice versa. We 
use this property to map 2D-FFT on the cluster processors. The algorithm consists of 
three phases : 1D-FFT computation along rows, transposing the intermediate results and, 
1D-FFT along the columns. 

Figure 4.7 describes the algorithm. In the first phase each processors is assigned NIP 
rows. Let’s denote the sequence of rows with processor Pi as Granule(i). Also, let’s 
divide each granule into P equal blocks of size N 2 /P 2 as shown in Figure 4.8. A block 
B(ij) denotes a block of size N 2 iP z with processor P, , 0<=j<=P-l. Each processors 
computes the ID- FFT along the rows of its granule. Then in the second phase, the pro- 
cessors communicate with each other in the following manner to transpose the intermedi- 
ate results. A processor P, sends block B(i,j) to processor Pj for all 0<=j<=P-l ,j*i. Each 
processor needs to communicate and exchange a block with every other processor in the 


ALGORITHM 2D-FFT 


/* The fol 
M := 


Each processor P, receives granule(i) of rows 


[owing description is with respect to processor P, */ 


N 2 


For k = 1 to M do 

compute 1D-FFT of row(k) of granule(i) 


For j = 1 to M do (i * j) 
k = i+j mod P 
connect P, to P t 
send Block(i,j) to P* 
receive Block(j»i) from P* 

For k = 1 to M do 

compute 1D-FFT of row(k) of granule(i) 

END 2D-FFT 

Figure 4.7 : 2D-FFT 
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Image Size : N x N 
N 



needed to transpose intermediate data. 


P 0 


Pi 


Pp-l 


Figure 4.8 : An Example of Mapping 2D-FFT onto Four Processors 


cluster. However, by performing the communication systematically, the transpose can be 
achieved without any conflicts as described in the algorithm. Finally, each processor 
computes 1D-FFT along the columns. 

The 1D-FFT for size N can be done in 0(NlogN) time [51]. The constant of multi- 
plication is 6, i.e., to perform N point 1D-FFT it takes approximately 6NlogN floating 
point operations. Therefore, the computation time for the above algorithm is (for both 
row and column) 

12xtflXN*Nxlog 2 N 
tcp= P 

The communication time to transpose the intermediate results is 

l comm = 2 x(P-\)xN 2 /P 2 

and the number of switch settings are, =P-l. 
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One important observation is that even though FFT is a Global Fixed algorithm, in 
the above mapping both the computation and communication times reduce as the number 
of processors increases. In other words, both computation and communication are 
decomposable for parallel processing. Therefore, if the communication is achieved 
without conflicts (as in our case), we can obtain linear speedups. 

Figures 4.9 and 4.10 show the performance 2D FFT on a processor cluster. From 
Figure 4.9 we can observe that almost linear speedup can be obtained. The variation of 
the communication time as a function of the processor is shown in Figure 4.10. Note that 
the communication time curve follows the computation time curve in its shape and the 
communication is completely decomposable. 



Speedup 


Figure 4.9 : Performance of 2D FFT on a Processor Cluster 
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Figure 4.10 : Communication Time for 2D FFT on a Processor Cluster 

4.3.4. Hough transform 

Hough transform is global varying algorithm. Also, the communication cannot be 
decomposed. Hough transform is a method to detect shapes such as straight lines, 
curves, circles, and ellipses in an input image [52]. The method is to perform the compu- 
tation in the parameter space of the curves. For detecting line segments, normally the 
computation is done in the (/-,0) parameter space. If there exists a line whose normal dis- 
tance from the origin is r, the normal makes an angle 0 with the x-axis then, if the point 
(x,y) lies on that line then the following equation is satisfied. 

r = xcosQ + ysinQ 

First r, 0 are quantized. The quantization depends on how much accuracy is required in 
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the final result. Assume that the maximum value of r br r m „ maximum value of 0 be 
0 m «x (generally re). Then if r r „, Q res are the resolutions used for quantization, the total 
number of accumulator cells in the computation are r m „.0 m »/r r£ ,.0 r<J , the number of 
rows and columns in the accumulator array being 0 C = 0 miuc /0 rej and p c = r max /r res , respec- 
tively. The algorithm involves two major steps. The first step is to accumulate votes in 
the accumulator array for various digitized r and 0 values. The second step is to compute 
local maxima in the output of the first step. The first step is regular and suitable for SIMD 
implementation. The second step is more suitable for MIMD implementation because the 
output is global data dependent For example, an image containing many lines will result 
in many more maxima than an image containing a few lines, and therefore, the required 
computation will vary. Hence, Hough transform is a hybrid algorithm containing both 
SIMD and MIMD algorithms. 

We present two mappings of the Hough transform algorithm for parallel processing 
on the processor cluster. The first mapping divides the input image into as many granules 
as the number of available processors. The second mapping divides the tasks depending 
on the parameters and desired quantization. The former is referred to as "data partition- 
ing" and the latter as "parameter partitioning." We discuss advantages and disadvantages 
of both the mappings and also compare the computation time, communication time and 
memory requirements for both mappings. 

Data Partitioning 

Assume that the input image is NxN, and to simplify the discussion assume that the 
number of available processors is P =p 2 . The image is partitioned into N 2 /p 2 blocks. 
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Processor P(i,j) works on block i*p + j , where \<i,j<p. Each processor computes the vote 
count for its part of the image for all quantizations of 0 values. Figure 4.11 shows the 
accumulator array for a processor. Note that each processor has to maintain a complete 
accumulator array of size p c x0 c and update the appropriate vote count computed from its 
share of the image. The algorithm ACCUMULATE_COUNT in Figure 4.12 shows the 
computation for this step. The computation time to compute the accumulator array is 


time taken to perform 2x 


xfxQ c multiplications and half as many additions, where f 



Figure 4.1 1 : Accumulator Array for Hough Transform 


ALGORITHM ACCUMULATE^ OUNT 

Each processor P,, \<d<p 2 does the following (in parallel) 

For] = 1 to 0 e do 

For each (x,y) in the subimage such that (x,y) is significant do 
/♦significant means black pixel or edge element*/ 
compute r(0 ; ) = x cos0 ; + y sin0 ; 

Accum_array(0 ; -,r(0 ; )/r ra ) = Accum_array ( 0 ; ,r(0, )/r res ) + 1 
End_For 
End_For 

END ACCUMULATE_COUNT 

Figure 4.12 : Algorithm to Compute Votes in Hough Transform 


83 


is the largest fraction of significant pixels in a block and 0 C is the number of quantizations 
for 0. The next step is to combine the partial results of all the processor to obtain a global 
accumulator array so that maxima can be determined. For combining the partial results 
we propose the tree sum method in which, at each step, twice as many processors com- 
bine their partial results, therefore requiring 2xlogp-l steps. 

The algorithm ACCUMULATE_SUM in Figure 4.13 performs the merging of par- 
tial results. The processors are numbered from 0 to p 2 - 1 . A processor with number k, 
0<k<p 2 -l corresponds to a processor (ij) such that k-i*p+j. 


I* Accum_array k (i,j) denotes the accumulator cell (ij) 
the Accumulator array of processor k. */ 


ALGORITHM ACCUMULATE_SUM 


For i = 0 2xlog 2 p-l do 

For all processors Pj do in parallel (0<j<p 2 -l) 
If j mod 2' +1 =2' then 
Connect Pj — > /> ; _ 2 « 

For k = 1 to 0 C do 
For 1 = 1 to p c do 


Send Accum_array j(k,l) Pj_y 
Accum_arrayj- 2 i (k, l) := Accum_array / _ 2 < (k, l) 
+ Accum_arrayj(k, /) 

EndFor 

EndFor 

Endjf 

EndFor 

EndFor 

END ACCUMULATE_SUM 


Figure 4.13 : Algorithm to Accumulate the Vote Count 
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Following this step, processor P 0 has the entire accumulator sum. The next step is to 
distribute this global accumulator sum to all the processors so that computation for local 
maxima can be performed in parallel. This step needs only one step. Processor Pq broad- 
casts the entire array to all the processors using the broadcast facility of the crossbar. 
After the broadcast step, each processor performs a search for local maxima on its share 


of the accumulator rows 



. In this algorithm, for each entry in its block of the accu- 


mulator array, the processor determines whether the entry represents a local maxima in a 
neighborhood. 


In summary, the total computation and communication time requirements for the 
entire hough transform algorithm using the data partitioning are as follows. 

N 2 1 


Up = 3xtyx 


2/„2 


xfxQ c + Q c xCxlogP + Q c xp c xw z lp 


where, the first term is for computing the votes, the second term is to sum the accu- 
mulator array and the third term is for looking for local maxima in a window of size w 2 . 
The communication time for this algorithm is 

Uomm = (logP+\)x$ c xp c 

and the number of switch settings are r w = logP+1. 


Unlike 2D-FFT, the communication is not decomposable. In other words, the com- 
munication increases as the number of processors increases in a cluster. In the following 
mapping we will observe that it is possible to reduce the communication such that instead 
of it increasing as a function of number of processors in the cluster, the communication 


remains constant. 
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Figure 4.14 shows the computation and communication time along with the speedup 
for hough transform. Even though the computation time for hough transform decreases as 
the number of processors increases, the computation is not completely decomposable. 
The second term (to combine partial results) of tq, increases as a log function of the 
number of processors. Furthermore, the communication overhead to combine accumula- 
tor arrays also increases logarithmically with the number of processors. Consequently, 
for a large number of processors, the communication time becomes comparable to the 
computation time (as shown in Figure 4.14), and that results in degradation in speedup. 
We will observe in the following that it is possible to obtain almost linear speedups. 


Processing Time 
(In Secs.) 



Speedup 


Figure 4.14 : Performance of Hough Transform (Data Partitioning) 
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Parameter Partitioning 

In this mapping, instead of partitioning the data to the processors the parameters 
space is partitioned. Each processor works on the entire image but computes the vote 
count for only few 0 values. Each processor computes all r values for its share of 0 
values. If there are p 2 processors, then each processor gets n = QJp 2 values of 0 to work 
on. Therefore, processor i gets to work on n values of 0, where \<i<p 2 . There are several 
advantages to this mapping, both in terms of communication and implementation at each 
processor. First of all, when looking for peaks later, a processor needs to communicate 
with only two other processors to obtain the upper and lower boundary rows of the Accu- 
mulator array. Secondly, we introduce additional data structures to make the search for 
local maxima efficient, where instead of searching for the local maxima in the entire 
accumulator array, only a fraction indicating possible local maxima need to be searched. 
Furthermore, the processor can store sin0, cos0 values for its allocated n values of 0 in its 
registers, since only few values need to be stored. This results in saving on local memory 
accesses delays which would occur if all quantized sin0 and cos0 values are stored with 
each processor in its local memory. The algorithm to compute the accumulator array at 
each processor is similar to that in the case of data partitioning, except that each proces- 
sor works on the entire image but only its own part of the parameters. 

A brief explanation of the algorithm is as follows. In the first step (computing 
votes), the algorithm computes value of p for each significant pixel for all 0 values. It 
then increments the appropriate count in the Accumulator array. If the count increases 
beyond a certain threshold value, there exists a possibility of this being a local maxima. 
Therefore, another array called the Link_array is updated marking this fact. This step 

■ ~L 
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reduces the search space when looking for local maxima since normally a very small 
fraction of the image contributes to lines and entire Accumulator array need not be 
searched when looking for local maxima. Once the above computation is finished for the 
entire image, processor P t communicates with P;+i and P t .\ to obtain the boundary rows 
of the Accumulator array. Then the local maxima are computed in the Accumulator 
array using the information available in Link_array. There is a need to search only those 
entries in the Accumulator array for a local maxima which are marked by the Link_array. 
The computation, communication and memory requirements for this mapping are as fol- 
lows. 


t cp = 3xry,x 



x/x0 c + 0 c xp c xw 2 /p 2 


where the first term is for computing the votes and the second term is to for local 
maxima in a window of size w 2 . The communication time for this algorithm is 

tcomm = 2xp c 

and the number of switch settings are - 2. 

The memory requirements of the two partitionings are comparable. For example, 
for a typical image size of 512x512, value of p c will typically be 512xV2, and C will be 
180. However, each pixel normally is a byte where as each accumulator cell is an integer. 
Assuming a 4 byte integer, in data partitioning a processor has to store the entire accumu- 
lator array of size 521 Kbytes (approximately), and in the second mapping a processor 
has to store the entire image (256 K bytes) and its part of the accumulator array. 

There is another way in which the parameter partitioning mapping can be per- 
formed. Instead of storing the image in all the processors, a controller processor, such as 
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a DSP, can store the image and broadcast each significant pixel value and its location 
while processors compute the votes in an SIMD lock-step fashion. This results in saving 
the memory, because now only one processor need store the image. We make the fol- 
lowing observations. The communication requirement is fi<N 2 , where f is the fraction of 
significant pixels. However, the communication can be overlapped with computation 
because while processors are computing the vote count for a location in the image, the 
next location can be broadcast. Therefore, the time to compute the Accumulator_array in 
this case will be MAX(f C|> , Broadcast time for fxN 2 pixels locations). 

By using parameter partitioning the overhead of combining partial results is elim- 
inated, and for each processor the communication is reduced to exchanging one row of 
the accumulator array with two other processors. Therefore, the communication remains 
constant as the number of processors increases. Figure 4.15 shows the speedup, computa- 
tion time and communication time for hough transform using parameter partitioning. Fig- 
ure 4.16 compares the communication overhead and the speedup for the two types of par- 
titioning. Notice that using parameter partitioning it is possible to obtain almost linear 
speedup. 

4.4. Parallel Implementation Results 

This section contains implementation of some algorithms on a simulated processor 
cluster. A cluster was simulated on an intel iPSC/2 hypercube multiprocessor. The per- 
formance results capture all the overheads associated with parallel programming, and 
therefore, the results are very accurate. Also, we show through the example of 2-D FFT 
algorithm that the analysis presented in the previous section is very close to the imple- 
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Processing Time 
(In Secs.) 



Speedup 


Figure 4.15 : Performance of Hough Transform (Parameter Partitioning) 
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Figure 4. 16 : Comparison of Performance of PP and DP for Hough Transform 

mentation results. We present performance results for four algorithms in this section. 
Two algorithms are 2-D FFT and separable convolution. The other two algorithms are 
parts of the Image Understanding Benchmark Algorithms developed by Weems et al [1]. 
The two algorithms are sobel edge detection and median filtering. The performance of 
the algorithms has been evaluated using the test data provided with the benchmark algo- 
rithms [1]. 

4.4.1. 2-D FFT 

A mapping of 2-D FFT has been described in Section 4.1. Figure 4.17 shows the 
performance of 2-D FFT on a 16 processor cluster (image size 256x256). Other parame- 
ters are the same as given in Table 4.1. Solid lines in the graph show the computation 
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times for analysis (symbol +) and implementation. We observe that the analytical results 
are very accurate. However, the implementation times are a little more than that given by 
analysis because implementation captures the overhead of index management, etc. which 
is not included in the analysis. The graph also shows the corresponding speedups for 
both cases. Note that speedups obtained through analysis and implementation are almost 
the same and are practically indistinguishable. Figure 4.18 shows graphs for the com- 
munication time. Again, implementation and analytical results are very close to each 
other. 


Processing Time 
(In Secs.) 



Speedup 


Figure 4. 17 : Performance of 2-D FFT on a Cluster (Analysis and Implementation) 
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Figure 4.18 : Communication Time for 2-D FFT 

4.4.2. Separable convolution 

Table 4.2 shows the performance for separable convolution implementation on a 
256x256 image with window size 10x10. The table shows the major computation opera- 
tions in the algorithm which include floating point operations as well as integer opera- 
tions. The fifth column shows the number of times connection in the crossbar needs to be 
changed during the algorithm execution, and column 6 contains the rounded value of the 
amount of data communicated in KBytes. The table shows that the communication time 
is very small compared to the computation time, and therefore, good speedups are 



obtained. 
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Table 4.2 : Separable Convolution Implementation Results 


Separable Convolution 


Window 10x10 

No. 

Proc. 

FI. Point 
K. Ops 

Other 
K. Ops 

Comp. 
Time (ms.) 

Comm. 

Setup 

Comm. 
K Bytes 

Comm. 

Tlme(ms.) 

1 

3932 

3932 

2607 

0 

0 

0 

2 

1966 

1966 

1310 

2 

20 

4.09 

4 

983 

983 

658 

3 

20 

4.09 

8 

492 

492 

332 

3 

20 

4.09 

16 

246 

246 

169 

3 

20 

4.09 


4.4 3. Benchmark Algorithms 

The Image Understanding Benchmark provided the serial version of the programs 
and the data [1]. We implemented sobel edge detection and median filtering algorithms. 


4.4.3.I. Sobel 

Sobel edge detection is a two-dimensional convolution operation with a 3x3 mask. 
The implementation was done using medium grain parallelism in an MIMD mode, and 
mapping was similar to that of separable convolution. Table 4.3 illustrates the perfor- 
mance results for sobel edge detection algorithm. There were six data sets but here we 
present results using only one data set (test, size 256x256). The results obtained on other 
data sets were similar. The table includes all overheads, including program load time, 
data load time, data input time (from global memory), and time to gather results. If all the 
overhead is included, then the performance for larger cluster size is sublinear. There are 
two main reasons for this performance. First, amount of computation per pixel is very 
small (3x3 convolution), and secondly, all the overhead is included in the computation of 
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the speedup. The parameters for communication bandwidth are conservative (20 
MBytes/sec.) and if the bandwidth is assumed to be larger, then the performance is 
expected to be much better. 

4.4.3.2. Median filtering 

Table 4.4 shows the performance results for the median filtering algorithm. The 
algorithm was evaluated on the same data set. Size of the median filter was 5x5. Data is 
partitioned along the rows. Each processor is allocated an equal number of rows and two 
boundary rows in each direction. There is no need for communication during the algo- 
rithm execution. Median filtering does not involve any floating point multiplication or 
addition operations (only comparison operations are needed). Table 4.4 shows that we 
can obtain good speedups on a cluster for median filtering. 


4.5. Summary 

To evaluate parallel algorithms on a cluster, we explored alternative mapping stra- 
tegies and computation modes. Some of the algorithms have been implemented on a 


Table 4.3 : Sobel Edge Detection 


Sobel (Test) 

No. Proc. 

Proc. 

Data load 

Result Output 

Prog. Load 

Data Input 

Total 

Speed up 


Tim e< sec.) 

Time(Sec.) 

Tim e( sec.) 

Tlme(sec.) 

Tlme(sec.) 

Tim e( sec.) 










1 

4.04 

0 

0 

0 


4.05 

1 

2 

2.02 


0.014 

0.001 


2.1 

1.92 

4 

1.01 


0.014 

0.001 

0.008 

1.09 

3.70 

8 

0.51 

0.056 

0.014 

0.001 

0.008 

0.589 

6.91 

16 

0,26 

0.056 

0.014 

0.001 

0.008 

0.33 

12.13 

32 

0.13 

0.056 

0.014 

0.001 

0.008 

0.21 

19.71 




















95 


Table 4.4 : Median Filtering 


Median Filtering (Test) 

No. Proc. 

Proc. 

Data load 

Result Output 

Prog. Load 

Data Input 

Total 

Speed up 


Time(sec.) 

Time(Sec.) 

Tim e{ sec.) 

Time(sec.) 

Tlme(sec.) 

Time(sec.) 










1 

6036 

0 

0 

0 


60.37 

1 

2 

30.17 


0.056 

0.001 


30.30 

1.99 

4 

15.19 


0.056 

0.001 

0.008 

15.31 

3.94 

8 

7.72 

0.056 

0.056 

0.001 

-0.008 

7.85 

7.70 

16 

3.99 

0.056 

0.056 

0.001 

0.008 

4.11 

14.68 

32 

1.90 

0.056 

0.056 

0.001 

0.008 

2.02 

29.93 


simulated cluster, and we show that the analysis provides very accurate results. The per- 
formance results show that very good speedups can be obtained on a processor cluster in 
any computation mode. The parameters chosen for processor speed and communication 
speed were very conservative. We think that much faster processors and communication 
links are possible and available with today’s technology, and therefore, the performance 
results presented in this chapter are also conservative. 
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CHAPTER 5 . 

INTER-CLUSTER COMMUNICATION IN NETRA 


The focus of this chapter is inter-cluster communication in NETRA and perfor- 
mance evaluation of parallel algorithms when mapped across multiple clusters. When an 
algorithm is mapped on multiple clusters, processors belonging to different clusters may 
need to communicate, and therefore, inter-cluster communication is needed. However, 
unlike intra-cluster communication between processors, there may be conflicts in access- 
ing the global interconnection network, global memory or other common resources. 
These conflicts need to be taken into account when computing the inter-cluster communi- 
cation, and consequently, performance of algorithms when mapped across multiple clus- 
ters is affected by conflicts. In this chapter we present a method to evaluate inter-cluster 
communication time under the presence of conflicts. The method is based on the work by 
Patel [53,54]. 

This chapter is organized as follows. Section 5.1 presents alternative inter-cluster 
communication strategies in NETRA. Analysis of inter-cluster communication strategies 
is presented in Section 5.2. How the analysis can be incorporated into the performance 
evaluation of algorithms is the subject of Section 5.3. Section 5.4 contains the perfor- 
mance evaluation of various algorithms whose performance on one cluster was presented 
in the preceding chapter. Finally, Section 5.5 summarizes the chapter. 
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5.1. Alternatives for Inter-cluster Communication 

5.1.1. Multistage interconnection network and global memory 

In this method global memory is used for inter-cluster communication. The global 
memory is accessed through the multistage interconnection network by processors in a 
cluster or by a DSP. A processor(s) needing to send data to another processor in a dif- 
ferent cluster writes the data into designated locations in the memory. This involves set- 
ting the appropriate circuit through the global multistage interconnection network to the 
memory module followed by a data transfer. The data is transferred in block mode. The 
Memory Line Controller (MLC) updates the information about the destination port(s), 
length of the data block, and block’s starting address, and sets a flag indicating the avai- 
lability of data. Now the destination processor can read the data using this information. 
Note that this method permits out of order requests to be serviced. For example, if the 
destination processor tries to read the data before it has been written, the MLC informs 
the processor of this situation, and when the data is really written into the global memory 
then the MLC informs the destination processor. This is a block level data-flow approach. 
The main advantages of this approach are that asynchronous communication is possible, 
out of order messages can be handled, and efficient pipelining of data can be achieved. 
This method is depicted in Figure 5.1. The Figure shows how a processor P t of cluster C, 
will communicate with processor P; of cluster Cj using the strategy. 

5.1.2. DSP tree links 

The second alternative to performing inter-cluster communication is to use the DSP 
tree links. However, for distant inter-cluster communications, the tree may not perform 
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Figure 5.1 : Inter-cluster Communication Using Global Memory 


well because of the root bottlenecks typical to any tree structure. The main function of 
the tree structure is to provide control hierarchy for the clusters. Its links are mainly used 
for program and data broadcast to subtrees, and DSPs use the tree links to send (receive) 
control information to (from) other DSPs. Therefore, the DSP tree is designated mainly 
for control function, and we do not envision it to be used for large data transfers between 
distant processors. 

5.1.3. Global bus 

The third alternative strategy to perform inter-cluster communication is to use a 
high speed global bus that connects all DSPs and one port from each cluster. The global 
memory is also connected to the bus and is accessible to all clusters via the bus. Note that 
the global bus is proposed to be an alternative global interconnection to the multistage 
interconnection network. If the bus can be designed fast enough (such as by using fiber 
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optics), and if inter-cluster communication is low, the global bus presents a viable alter- 
native to the multistage interconnection network. In this scheme, the communication is 
done explicitly by messages, and it is synchronous. Figure 5.2 show this communication 
method. The Figure shows how a processor P t of cluster C, will communicate with pro- 
cessor Pj of cluster Cj using the strategy. 

5.2. Analysis of Inter-cluster Communication 

Inter-cluster communication is needed in the following cases : i) An algorithm is 
mapped in parallel on more than one cluster and the processors need to communicate to 
exchange partial results or combine their results, ii) In an integrated vision system, output 
data of a task produced at one or more clusters needs to be transferred to the next task 
executing on different clusters, iii) It is needed to perform input and output of data and 


GLOBAL BUS 



Figure 5.2 : Inter-Cluster Communication Using the Global Bus 
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results. The amount of inter-cluster communication depends on the type of algorithms, 
how they are mapped in parallel, frequency of communication and amount of data to be 
communicated. 

There are several parameters that affect the inter-cluster communication time. The 
architecture dependent parameters are the number of processors (i.e., number of clusters 
and number of processors in each cluster), number of memory modules, number of pro- 
cessors per port connected to the global interconnection, and the type of interconnection 
network. Some parameters depend both on the architecture as well as on the type of algo- 
rithms, how they are mapped, and their communication requirements when mapped onto 
multiple clusters. Furthermore, not only does the communication time depend on the 
underlying algorithms but it also depends on the network traffic generated by other pro- 
cessors in the system because there may be conflicts in accessing the network as well as 
memory modules. 

We consider an equivalent model of the architecture as shown in Figure 5.3. The 
model shows N processors connected to M memory modules through a global intercon- 
nection network. N is given by C xP t + N dsp , where C is the number of clusters, P, is the 
number of ports in each cluster and is the number of DSPs in the system. For simpli- 
city, we assume that each cluster contains an equal number of processors. The number of 
physical processors will be given by C xP, xP p , where P p is the number of processors per 
port. 

The following analysis is based on the analysis presented by Patel in [53,54]. He 
developed an analytical model for evaluating alternative processor memory interconnec- 
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Figure 5.3 : Equivalent Model for Global Communication 


tion performance and showed that the analysis is reasonably accurate. Consider execu- 
tion of a typical parallel algorithm on multiple clusters. The execution will consist of pro- 
cessing, intra-cluster and inter-cluster communication. Figure 5.4 shows the computation 
and communication phases of an algorithm. The computation time is given by t cp , the 
intra-cluster communication time is given by t ch and the inter-cluster communication 
time is given by tid in terms of equivalent processor cycles. However, due to conflicts in 


tid t C p + t c l t lc l 



With Interference 

Figure 5.4 : Computation and Communication Activities of a Processor 
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the network or conflicts in accessing memory modules, a processor may have to wait for 
w a cycles before being able to access the network and write to (or read from) the 
memory. In effect, this can be seen as the communication time being elongated by a fac- 
tor w for each request, and instead of communication time being t^, it is now wxr^ as 
shown in Figure 5.4. Therefore, if the probability of accessing the global network in each 
processing cycle is m and for each access the communication time is t ic i, then the useful 
computation for t processor cycles takes t + mxtxwxt^, where t = +t el . The fraction of 
useful work (utilization U) is given by 

U t + mxtxwxtid ' (5-1) 

The average number of busy memory modules (or fraction of time when the bus is busy 

when the global interconnection is a bus) is 


Nxmxtictxt 
t + mxtitfXtxw 

and in terms of utilization. 


(5.2) 


B - NxmxticixU. (5.3) 

In [53], it is shown that the utilization primarily depends on the product mxr k/ rather 
than m and individually. In other words, the processor utilization primarily depends on 
the traffic intensity and to a lesser extent on the nature of the traffic. 

For a particular algorithm, all the parameters are known except w. The probability 
of accessing the global network is essentially given by the number of times communica- 
tion is needed per processor cycle and is known when an algorithm is mapped in parallel. 
The factor w depends on the algorithm parameters as well as the interference from other 
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processors accessing the global network and the memory, number of processors, number 
of memory modules, the type of interconnection network and the access rate m. 

Consider the processor activities again. A processor needing to access the global 
memory or the bus submits requests again and again until accepted; on an average this 
happens for (vv-l)xf w time units. After the request is granted, the processor has a path to 
memory for time units. In other words, the network sees an average of wxr lc/ consecu- 
tive requests for unit service time. Therefore, the request rate (for unit service) from a 
processor as seen by the network is 

mxwxtjd 

m = . (5.4) 

1 + mxwxtid 

and in terms of utilization 
m' = 1 - {/. 

For details, the reader is referred to [53]. 

The model that we analyze is a system of N sources and M destinations. Each 
source generates a request with probability m' in each unit time. The request is indepen- 
dent, random, and uniformly distributed over all destinations. Each request is for one unit 
service time. The following is an analysis for a bus and for multistage delta network. 

Bus : We know from earlier discussion (Equation 5.3) that 

B =Nxmxt icl xU (5.5) 

and also, assuming all sources have the same request rate, average amount of time the 

bus is busy is given by 


B =[ 1 -( l-m'f ]. 


(5.6) 
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Equations 5.4 and 5.5 result in a non-linear equation 

Nxmxt^xU - [ 1 - ( 1 - m' f = 0. (5.7) 

In the above equation, value of m' can be substituted in terms of w, and hence, value 

of w can be computed. If the request rate from sources is not uniform, i.e., if the request 
rate from source N ,• is m, then the above equation becomes 

jmN j-N 

y. (wiiXt^/ (j)xU y) — [ 1 — J^(l — /n y ) ] = 0. (5.8) 

» j-i 

When evaluating performance of a parallel algorithm mapped across clusters there 
will be two request rates, one for the processors taking part in executing the algorithm 
and the other for the rest of the processors in the system which will be an input parame- 
ter. 

Multistage-Interconnection (Delta) : A delta network is an n stage network constructed 
from axb crossbar switches with a resulting size of a n xb H . Therefore, N = a n and M =b n . 
For a complete description refer to [54]. Functionally, a delta network is an interconnec- 
tion network which allows any of N sources (processors) to communicate with any one of 
the M destinations (memory modules). However, two requests may collide in the net- 
work even if the requests are made to different memory modules. We use results 
from[53, 54] to obtain the average number of busy main memory modules B, which is 
given by 

B =Mxm, (5.9) 

and the following equation in satisfied. 


Nxmxt^xU - Mxm n = 0 


(5.10) 
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where, 

m; „ 

m 1+1 = 1 - ( 1 — — ) , 0<i <n 

D 

and, m 0 = 1 - 1/. 

For details, the reader is referred to [53, 54]. 

These equations are solved numerically to obtain the interference delay factor w 
which is used in the performance evaluation of algorithms mapped across multiple clus- 
ters. 


5.3. Approach to Performance Evaluation of Algorithms 

Performance of an algorithm mapped on" multiple clusters is governed by various 
factors. Table 5.1 summarizes the parameters affecting the performance of a parallel 
algorithm. The approach to evaluating the performance of an algorithm is as follows. 
Using the parameters and a particular mapping, computation (t cp ), intra-cluster communi- 
cation ( t c i ) and inter-cluster communication time are determined. The traffic inten- 
sity for a processor(s) (or a cluster depending on how an algorithm is mapped) is given 

by — — — . Using the traffic intensity values, and using a range of traffic intensity values 

tcp+tcl 

for interference, the effective bandwidth of the network is determined; that is, the factor 
w is computed. 

Consider a parallel execution of an algorithm across clusters. If the execution time 
when the algorithm is executed on a single processor is t seq , then the speedup in the best 
case is given by 
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Table 5.1 : Parameters for Performance Evaluation 


Total No. of Processors N„ 

512 

Cluster Size P c 

8-128 

No. of Processors/Port P„ 

4 

Image Size NxN 

512X512 

Memory Modules M 

128 

Processor Speed 

5 MIPS, 5 MFLOPS 

Network Speed (Block Transfer) 

20 Mbytes/Sec. 

Traffic Intensity for 
Interference (mxr) 

0.1.0.4.0.8 


Sp = 


*stq 


?cp ^cl ticl 


(5.11) 


that is, assuming there is no interference while accessing the network or the global 
memory. Under the conditions in which there are conflicts while accessing the network, 
the inter-cluster communication time is given by wxt^, and therefore, the speedup is 
given by 


Sp' = 


l seq 


tcp + td+WXtici 


(5.12) 


Hence, degradation in speedup with respect to the best case speedup will be 


Sp-Sp' _ (w-l)xr^ 

Sp tcp + tc+WXtu;,' 


5.4. Performance of Parallel Algorithms on Multiple Clusters 

The extent of inter-cluster communication depends on the type of algorithms, how 
they are mapped in parallel, frequency of communication, and amount of data to be com- 
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municated. As discussed in the previous chapter, these requirements vary for algorithms 
belonging to different classes. 

We are mainly interested in the performance evaluation of parallel algorithms when 
mapped across clusters. The performance of an algorithm will be affected by interference 
from other processors in the system which are not executing the particular algorithm 
under study. 

This section discusses the performance of various algorithms when mapped across 
clusters. The algorithms are selected according to their communication requirements. We 
have chosen one algorithm from each of the following categories: Local Varying, Global 
Fixed and Global Varying. Algorithms in each of these categories exhibit different com- 
munication characteristics, and therefore, the analysis will provide the performance of 
the architecture for a wide range of algorithms. 

5.4.1. Two-dimensional Fast Fourier Transform (2-D FFT) 

From Chapter 4 we know that a 2-D FFT can be performed in two steps : a one- 
dimensional N point FFT along the rows followed by a one-dimensional N point FFT of 
the intermediate results along the columns, or vice versa. We use this property to map the 
algorithm across clusters. Hence, dividing the data along rows will not require commun- 
ication when computing one-dimensional FFT. However, communication is needed to 
obtain transpose of the intermediate results. Figure 5.5 shows an example of the two 
steps and communication for three clusters. 

Clusters are allocated rows in proportion to their size. A cluster C, of size P c (i) (i.e., 

. . NxP c (i) 

containing P c (i ) processors) is allocated , where n is the total number of clusters 

5*c(o 

i»l 
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(a) Row EFT 
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(c) Column FFT 


Figure 5.5 : An Example of Mapping 2-D FFT on Three Clusters 


executing the algorithm. Within a cluster rows are equally divided among processors. In 
the first phase processors compute N point FFT of all the rows in their granule. In the 
second phase, to obtain transpose of the intermediate data, processors write the inter- 
mediate results into the designated global memory locations, which is read by other pro- 
cessors. Data remaining within a cluster is transposed using the cluster crossbar. 

The computation time in terms of number of instructions is given by the following. 
The total number of processors are given by P, and we assume all clusters have the same 
size (P c ). 

l2xN 2 log 2 (N)xt fl 

*cp p (5.14) 

where, t fl is the number of instructions per floating point operation. The intra-cluster 
communication time ( t cl ) and the inter-cluster communication time are given by 
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tcl = 


2x N 2 (P c -\) 

~P- 

,2 


hd — 


4xN z x(rt-l)xP p x/? 

nxp 


(5.15) 


(5.16) 


where P p is the number of processors per port and R is the communication speed of the 
network in terms of number of instructions per word transfer. 


Using these parameters for 2-D FFT traffic intensity, computation times and param- 
eters from Table 5.1, we evaluate the performance using the analysis presented earlier. 
Figure 5.6 shows the speedup obtained for the 2-D FFT algorithm. The X-axis shows the 
number of processors (cluster size is 16). For example, value 48 means that the algorithm 
is executed on 3 clusters, each containing 16 processors. The four different graphs in the 
Figure show speedups for no conflict (best case), low conflict, medium conflict and high 



Figure 5.6 : Speedup for 2-D FFT (Multistage Network) 
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conflict cases through the global interconnection network (multistage interconnection). 
Similar results are presented later in this section showing when a bus is used as the global 
interconnection network. It is observed that speedup obtained under varying degrees of 
conflicts through the network is comparable to that obtained in the best case. However, 
the best case speedup itself is not linear because of the delays through the network and 
the global memory. 

Figure 5.7 shows the computation and communication time for 2-D FFT as a func- 
tion of number of processors. Figure 5.8 shows a blown-up graph for the communication 
times. As we observe, the communication time is much smaller than the computation 
time. Furthermore, the communication time also decreases as the number of processors 
(clusters) increases. Also note that the intra-cluster communication time is much smaller 



Figure 5.7 : Computation and Communication Times for 2-D FFT (Multistage Network) 
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Figure 5.8 : Communication Times for 2-D FFT (Multistage Network) 

than the inter-cluster communication time. Figure 5.9 shows percentage degradation in 
speedup, as defined in Equation (5.13), for different levels of conflict in the network. 
The degradation in the speedup levels off after increasing initially because the communi- 
cation time decreases as the number of processors increases. 

Figure 5.10 shows the sensitivity of the speedup to the network bandwidth. The net- 
work bandwidth is normalized to computation speed. For example, value 1 on the X-axis 
means that it takes the same amount of time (amortized or block in block transfer mode) 
to write/read a word to/from global memory as it takes to execute one instruction. The 
region on the left of 1 indicates faster communication network and to the right of 1 indi- 
cates slower communication network. It is evident from the Figure that degradation in 
speedup occurs very fast as the communication becomes slower. Therefore, in order to 
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Figure 5.9 : Degradation in Speedup Due to Conflicts for 2-D FFT (Multistage Network) 
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Data Transfer Time/Word 
(In Terms of No. of Instructions) 

Figure 5.10 : Speedup vs. Network Speed 

obtain any significant speedups from parallel computation, it is important to have 
matched computation and communication speeds; otherwise, increasing the number of 
processors or the processor speeds will not improve the performance as expected. The 
Figure also illustrates that the four graphs in the Figure diverge as the communication 
becomes slower meaning that slower performance under heavy traffic suffers more in the 
slower network than under light traffic. 

The following is a discussion of the performance of 2D-FFT when a bus is used as a 
global interconnection network. The algorithm is mapped as described above. Since glo- 
bal bus can be accessed by only one processors at a time, the inter-cluster communication 
time becomes additive as the number of clusters is increased. Therefore, the performance 
is expected to be worse than that in the case of the multistage interconnection network. 
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The total computation time remains the same as in the previous case and is given by 
UxN 2 \og 2 (N)xt fl 

‘cp p • ( 

However, the inter-cluster communication time becomes 


*^-‘2 xRxN 2 2xRx(n-\)xN 2 

tid ~ 2 ~~ 2 ' (3.1o) 

i- 1 * * 

/._n 

In other words, each cluster needs to send - fraction of its data to transpose the 

n 

intermediate results. This is achieved by a designated processor in each cluster, which 
collects the data and broadcasts it on the bus to be read by other cluster processors. 
Hence, there is an additional overhead of collecting and distributing the intermediate 
data. The intra-cluster communication time in this case is given by 


*cl = tcl\ + tell + l cl3 

where. 


2x(P — 1 )xN 2 

for within cluster transpose, t ct \ = ^ , and, 

P£xn 


for sending, receiving and redistributing the intermediate data, t d 2 = tj 3 = 


2xN 2 x(rt-l) 


Using these parameters, we evaluate the performance of 2-D FFT under varying 
degrees of conflicts on the bus. Figure 5.11 shows the speedup for 2-D FFT as a function 
of the number of processors (cluster size 16). When there is no conflict on the bus, the 
speedup increases with the number of processors. However, under conflicts, the speedup 
first decreases and then increases slowly. In fact, for medium and high conflicts, the 
speedup obtained on one cluster is better than that obtained using multiple clusters, the 
reason for such poor performance is that even though the communication is decompos- 
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Number of Processors (Cluster Size 16) 

Figure 5.11 : Speedup for 2-D FFT (Global Bus) 

able in 2-D FFT, the inter-cluster becomes communication time additive due to the bus 
and increases as the number of clusters executing the algorithm increases as shown in 
Figure 5.12. It is evident from the Figure that the computation time decreases but the 
communication time increases and becomes more than the computation time. 

Figure 5.13 shows the relative performance degradation in the speedup. The degra- 
dation is very significant. However, the degradation itself decreases as the number of pro- 
cessors (clusters) increases because more clusters execute the algorithm, and conse- 
quently, less number of clusters interfere. Figure 5.14 shows the sensitivity of the 
speedup to the bus speed. Again, the Figure shows that performance degrades rapidly as 
the bus becomes slower. In order for a bus to be viable global interconnection network it 
is essential that the bus bandwidth be much greater than the processor speed. 
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Figure 5.12 : Computation and Communication Times for 2-D FFT (Global Bus) 
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Figure 5.13 : Degradation in Speedup Due to Conflicts 2-D FFT (Global Bus) 
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Figure 5.14 : Speedup vs. Network Speed (Global Bus) 

5.4.2. 2-D separable convolution 

This algorithm consists of two steps. First convolution along rows using two one- 
dimensional masks and then convolution along columns of the intermediate results. Par- 
titioning along rows in clusters, therefore, avoids communication in the first step. How- 
ever, before the second step can be performed, boundary rows with each cluster need to 
be communicated to other clusters. Figure 5.15 shows the mapping on three clusters. 
Note that unlike in 2-D FFT, a cluster needs to communicate with at most two other clus- 
ters to obtain the upper and lower boundary rows of the intermediate results. The number 
of rows to be exchanged depends on the kernel size. For a kernel size of wxw, the 

number of rows to be exchanged along each direction is — . The amount of communica- 
tion is fixed and is independent of the number of clusters on which the algorithm is 
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mapped. The same mapping will work for regular 2-D convolution except that the 
amount of computation per pixel will be larger. 

The computation time for the two steps is given by 

2xtflXNx(—+l) ( 5 . 19 ) 

*cp 1 — tcp2 — p 

the intra-cluster communication is given by 

t cl = 2xA r xw, (5.20) 

and the inter-cluster communication is given by 

tid = IxwxNxR. (5.21) 

Figure 5.16 depicts the speedup obtained for the 2-D convolution algorithm as a 

function of the number of clusters (cluster size = 16). The speedup increases sublinearly 
as the number of clusters increases. The reason for not obtaining better speedup is that 
the computation per point of the input is small, the computation per processor decreases 



(a) Initial Mapping <■>> Boundary Rows 

along Rows into Global Memory 
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Figure 5.15 : An Example of Mapping 2-D Separable Convolution on Three Clusters 
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as the number of clusters increases, but the communication remains constant (as long as 

w 

the granularity per processor is at least — rows). Hence, the ratio of computation and 

communication decreases as the number of processors increases. The computation and 
communication times are shown in Figures 5.17 (a) and (b). Figure 5.16 compares the 
two times whereas Figure 5.17 shows only the communication time. 

Note that inter-cluster communication can be avoided completely if clusters are 
assigned overlapped rows to perform the first step. That is, if a cluster is responsible to 
compute 2-D convolution for R t rows, then its is assigned w + /?, rows. Therefore, each 
cluster has to perform additional computation to obtain 1-D convolution of w additional 
rows. If the extra computation time is less than the communication time then overlapped 
data partitioning is better. 



Figure 5.16 : Speedup for 2-D Convolution (Multistage Network) 
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Figure 5.17 : (a) Computation and Communication Time (b) Communication Times 

2-D Convolution (Multistage Network) 

Figure 5.18 shows a performance comparison of the two partitioning methods. 
When the number of processors executing an algorithm is small, the performance is 
almost the same. For smaller window sizes the difference is marginal and becomes 
apparent only when the number of processors becomes large. However, as the window 
size increases (40x40 in Figure 5.18 ), the performance with overlapped computation 
becomes poor because the overhead of extra computation becomes larger than the com- 
munication overhead. 

Figure 5.19 shows the performance of the algorithm when the bus is used as a glo- 
bal interconnection network. The speedup increases as the number of clusters increases 
but eventually levels off. Though inter-cluster communication time per cluster is con- 
stant, total communication time increases as the number of clusters increases, because 
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The box in the left graph has been blown up in the right graph 

Figure 5.18 : Overlapped Computation vs. Communication Trade-off 
(2-D Separable Convolution) 


only one cluster can send data on the bus at any time. This is illustrated in Figure 5.20 
where the communication time (with no interference) is a linear function of the number 
of clusters. Another reason for speedup to level off is that for a larger number of clusters 
the computation time becomes comparable or smaller than the communication time. 
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Speedup 



Figure 5.19 : Speedup for 2-D Convolution (Global Bus) 
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Figure 5.20 : Computation and Communication Times for 2-D Convolution (Global Bus) 
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5.4 J. Hough transform 

We have evaluated two mappings for hough transform, namely, Data Partitioning 
(DP) and Parameter Partitioning (PP). The difference between the two mappings is 
described in Chapter 4. Briefly, in DP, data is decomposed among clusters and, in PP, 
parameters are decomposed across clusters. 

Data Partitioning 

Data is allocated to clusters in proportion to their size. Within a cluster data is distri- 
buted equally among the processors. The algorithm consists of three phases. In the first 
phase, each processor computes and accumulates the count contributed by its data for all 
the parameter values. Note that each processor maintains the entire accumulator array. In 
the second phase, partial results are combined within a cluster, i.e., all the accumulator 
arrays are added together, and then a designated processor from each cluster writes the 
accumulator array to designated memory locations. Arrays from all the clusters partici- 
pating in the algorithm execution are then collected by one cluster. In the third phase, the 
cluster having the entire accumulator array computes the local maxima. 

Parameter Partitioning 

Under this scheme, each cluster is assigned the entire input data but is assigned only 
a part of the parameter space. The parameter space is partitioned in proportion to the 
cluster size. Each cluster receives two more parameters (boundary values) so that inter- 
cluster communication is avoided. That is, each cluster performs a fixed amount of addi- 
tional computation to avoid communication. Within a cluster, however, data is distri- 
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buted equally among the processors, and all processors work on the entire allocated 
parameter space. Dividing the parameter space results in mutually exclusive accumulator 
arrays with processors, and therefore, to compute local maxima, there is no need for 
inter-cluster communication. 

For DP, the computation and communication times for various phases are as fol- 
lows: t cpX is for computing accumulator count, t cp2 is for combining partial accumulator 
arrays within a cluster, r cp3 is for computing the final accumulator array, and t cp4 gives 
the time to compute the local maxima by one cluster. 


3xfj,xW 2 xe c 
tcp 1 ” p 

(5.22) 

— 

*cp2 - PcX0cXlOg 2 / > c 

(5.23) 


(n— l)xp c x 0 c 

{ cp% ~ p<; 

(5.24) 


3xp c x8 c 

tc p4 ~ p 

r C 

(5.25) 


Intra-cluster and inter-cluster communication times are give by 


— 

t c i = Vog 2 P c + l)xp c xQ c 

(5.26) 


nxRxP p xQ c x p c 

^ cl ~ p » 

r c 

(5.27) 


Similarly, the corresponding computation and communication times for PP are given by 


9 0 c 

3xt*xN 2 x(— + 2) 
t n 
Cpl ~ Pc 

(5.28) 

— 


0 C 

^ 2 =log 2 / > c xp c x(— +2) 


(5.29) 
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3xp c xe c 

' cp3= nxP c 

0c 

tel = (Jog 2 ^ c + 1)X( + 2)Xp c . 

n 


(5.30) 

(5.31) 


Figure 5.21 depicts the speedups for hough transform using the two partitioning 
methods. Due to the communication overhead through global memory, which increases 
linearly with the number of clusters, the speedup for DP levels off. Figure 5.22 shows 


Speedup 



Figure 5.21 : Speedup for Hough Transform (Multistage Network) 



126 


the computation and communication times for hough transform, whereas Figure 5.23 
shows the communication overhead for hough transform in detail. Data partitioning does 
not perform as well as parameter partitioning. However, degradation with respect to best 
case speedup in DP is small. As we can observe, good speedup can be obtained for a glo- 
bal data dependent algorithm like hough transform. Figure 5.22 and 5.23 illustrate the 
computation and communication times for the DP case. 

Figure 5.24 shows the speedup for hough transform (DP) and Figure 5.25 depicts 
the communication and computation times, respectively when the bus is used as a global 
interconnection network. Note that performance of the hough transform under PP will be 
the same in both cases because there is no global communication. 



Figure 5.22 : Computation and Communication Times for Hough Transform 




Figure 5.23 : Communication Times for Hough Transform 
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Figure 5.24 : Speedup for Hough Transform (Global Bus) 
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Figure 5.25 : Computation and Communication Times Hough Transform (Global Bus) 
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5.5. Summary 

In this chapter we presented inter-cluster communication strategies in NETRA. 
Inter-cluster communication is needed when algorithms are mapped across clusters to 
transfer data between tasks executing on different cluster processors, or to input and out- 
put data. Several factors contribute to the performance of algorithms mapped across clus- 
ters. Not only does an algorithm’s computation and communication characteristics con- 
tribute to its performance, but the system load, interference in accessing global intercon- 
nection network and global memory, and network bandwidth also contribute to the per- 
formance. We presented an analysis of how the effect of conflicts in global network and 
memory can be incorporated into the performance evaluation of parallel algorithms 
mapped across clusters. For each algorithm we presented one or more mapping stra- 
tegies, its performance evaluation and a discussion of the results. 

The performance results were used to compare alternative inter-cluster communica- 
tion strategies, and they show that it is possible to obtain good performance for algo- 
rithms with different characteristics under varying degrees of conflicts in a global inter- 
connection network. In general, a multistage interconnection network as the global inter- 
connection performs much better than a global bus, as expected. The parameters chosen 
for processor speed and communication speed were very conservative. We think that 
much faster processors and communication links are possible and available with today’s 
technology, and therefore, the performance results presented in this chapter are also con- 
servative. However, we obtained insight into the sensitivity of the performance measures 
as a function of various architecture parameters. 
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CHAPTER 6. 

DATA DECOMPOSITION AND LOAD BALANCING TECHNIQUES 


As discussed in Chapters 1 and 2, IVSs employ a sequence of image understanding 
algorithms in which the output of an algorithm is the input of the next algorithm in the 
sequence. Algorithms that constitute an integrated vision systems exhibit different 
characteristics, and therefore, require different data decomposition techniques and 
efficient load balancing techniques for parallel implementation. Since the input data of a 
task is produced as the output data of the previous task, this information can be exploited 
to perform knowledge based data decomposition and load balancing. 

This chapter presents several techniques to perform static and dynamic load balanc- 
ing schemes for IVSs. These techniques are novel in the sense that they capture the com- 
putational requirements of a task by examining the data when it is produced and using 
the knowledge of the computation in the next step. Furthermore, they can be applied to 
many integrated vision systems because many algorithms in different systems are either 
the same or have similar computational characteristics. These techniques are evaluated 
by applying them to the algorithms in a motion estimation system. It is shown that the 
performance gains when these techniques are used is significant, and the overhead of 
using these techniques is minimal. The evaluation is performed by implementing the 
algorithms on the hypercube multiprocessor system. The rationale behind using a com- 
mercially available machine is to capture all the overheads in implementations. 
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Data decomposition and load balancing techniques presented in this chapter are for 
medium to large grain parallelism. Two important characteristics of these techniques are 
that they are general enough to apply to any integrated system, and that they use statistics 
and knowledge from the execution of a task to perform load balancing and scheduling for 
the next task in the system. For example, in the motion estimation system sufficient 
knowledge can be obtained about the output data from the zero crossing step to perform 
efficient data decomposition and load balancing for the stereo match step. Knowledge 
from each step is used to perform load balancing in the next step. The advantages of such 
schemes are as follows. First, these techniques use characteristics of the tasks and the 
data, and therefore, work well no matter how the data changes. Secondly, many 
integrated vision systems consist of such tasks and exhibit the above described computa- 
tion flow, and therefore, these techniques can be used in any system (e.g., object recogni- 
tion, optical flow, etc.) . 

This chapter is organized as follows. Section 6.1 presents the algorithms for each 
step in the motion estimation system and most of the discussion in this section has been 
taken from [45]. These algorithms will provide insight into the involved computations in 
the above system, and any other such system, and provide a framework for the discussion 
in the following sections. Section 6.2 describes the proposed load balancing and data 
decomposition techniques. Section 6.4 presents a parallel implementation of these algo- 
rithms in an integrated environment and discusses the performance results for each of 
these algorithms, data decomposition and load balancing schemes. The underlying mul- 
tiprocessor machine on which we have implemented these algorithms is intel iPSC/2 
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hypercube. Some of these techniques have been applied to other integrated vision sys- 
tems and have been shown to work well [55]. 

6.1. Some Algorithms from Motion Estimation System 

Chapter 2 contains a brief description of the motion estimation system, and its con- 
trol flow is shown in Figure 2.2. This section describes the steps in the motion estimation 
system. A detailed description of the involved computations is included in order to 
understand the characteristics of such algorithms. The motion estimation algorithm con- 
sists of two processes. The first process is feature points extraction. Since the feature 
points used in our algorithm are edge points, we can extract them by locating the zero 
crossings of an image. The second process is matching and has three subprocesses which 
are i) stereo matching, ii) time matching and iii) elimination of multiple matches. The 
basic evidences exploited in these subprocesses to obtain unambiguous matched point 
pairs are the normalized correlation coefficient and the zero crossing patterns [56]. 

6.1.1. Feature points 

The feature points used in this algorithm are zero crossing points of an image which 
are computed using Laplacian-Gaussian masks [50]. In order to eliminate non- significant 
zero crossing points and maintain enough details, we threshold the zero crossing image 
based on the intensity gradient at each zero crossing point. Figure 6.2 depicts the thres- 
holded zero crossing images of the pictures shown in Figure 6.1. 
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Figure 6.2 : Zero Crossings of the Images in Figure 6.1 
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Each zero crossing point is associated with one of the sixteen possible zero crossing 
patterns as suggested in [56]. The similarity between any two zero crossing points is 
based on the directional difference of their zero crossing patterns. The directional differ- 
ence between any two direction values (e.g., D\ and D 2 ) is calculated as follows : 

DIFF = I D\ -D 2 I 
if {DIFF > 4), DIFF =18- DIFF I . 

In the matching process, the use of directional difference (or zero crossing pattern values) 
in finding matched point pairs is through the expression of directional difference weight 
as shown below : 

w ddif= y+diff' 

6.1.2. Matching 

Once zero crossings are extracted in all the involved images, the matching process 
is applied to find point correspondences among the images (two stereo image pairs at two 
consecutive time instants, i. e., r,_i and /,). The evidences used in this process to obtain 
matched point pairs are the normalized correlation coefficient and the directional differ- 
ence weight as mentioned above; furthermore, in order to limit the search space, the 
heuristic of limited displacement or disparity between frames is exploited. The matching 
processes in motion estimation consist of six steps described below. 

1) Perform stereo (from left to right) matching in the r,_! stereo image pair. 

2) Obtain unambiguous matched point pairs by eliminating multiple matches. 
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3) Perform time matching between the unambiguous matched points in the left tj_i 
image and the feature points of the left image. 

4) Obtain unambiguous matched point pairs from the time matched points by eliminat- 
ing multiple time matches. 

5) Perform stereo matching between the unambiguous matched points (obtained in 
step (4)) in the left image and the feature points of the right r, image. 

6) Obtain unambiguous matched point pairs from the results of t t stereo matching by 
eliminating multiple matches. 

The results of the above steps are two sets of unambiguous stereo matched point 
pairs at time instant r,_i and r,-. These two sets are related through Steps (3) and (4), the 
matching over time; therefore, all the unambiguous matched points that correspond to 
each other among the two stereo image pairs at time instants r,_j and t t can be selected. 

6.1.2. 1. Stereo matching 

This is the subprocess to obtain the matched point in the right image for each 
matchable zero crossing point in the corresponding left image of the same stereo pair. 
Since the imaging setup is the parallel axis method, the epi polar line constraint is 
exploited in solving the stereo matching problem. As a result, we have a one- 
dimensional search space instead of a two-dimensional search space in the stereo match- 
ing process. A typical search space in the right image for a matchable zero crossing 
point in the left image is on the left side of the transferred location of that particular left 
image zero crossing point; however, by using the heuristic of limited disparity between 
frames, the search space is limited to (the maximum possible disparity). 
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Let S r i be the set of all non-horizontal zero crossing points in the right image within 
the search space of a zero crossing point in the left image. The stereo matching process 
is as follows : 

For each point in S r i, 

i) Calculate the normalized correlation coefficient with a template size of sxs between 
the grey level images of left and right at the corresponding locations. The normal- 
ized correlation coefficient is calculated by using the following expression : 

y.y. Qxu - lx) ( rxij - rx) 

P. = . : i J . , 7 . = (6-2) 

where 

Ixij : value at point (i,j) in the left image. 

rxij : value at point (ij) in the right image. 

lx : mean value of the template in the left image. 

rx : mean value of the template in the right image. 

ii) If the normalized correlation value p, is less a threshold value thrsh Pi , we discarded 
that particular point in the search space in the remaining steps. 

iii) Calculate the directional difference weight (h ’ddif (suno)) between the left and the 
right zero crossing point (within the search space) according to Equation (6.1). 

iv) Obtain the total weight as the combination of the correlation coefficient and the 
directional difference weight. 
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Wj — d X Pj * 4 - b X tereo)* & ^ b — (6.3) 

v) Among all elements of S rt , the point with the maximum total weight w s is con- 
sidered as the matched point for the corresponding zero crossing point in the left 
image. 

6.I.2.2. Time matching 

This is the subprocess to obtain the matched point in the left r,- image for each candi- 
date zero crossing point in the corresponding left image. Similar to the stereo match- 
ing process, we exploit the heuristic of limited displacement (instead of disparity) 
between frames in solving the time matching problem. We assume that the total motion 
between the ti- j and r, frames is within /pixels in the vertical direction and h pixels (from 
right to left) in the horizontal direction. Hence, the search space for each candidate zero 
crossing point in the left r,_i image is a window of size fxh pixels on the left side of its 
transferred location in the left r, image. Any zero crossing point (except horizontal ones) 
inside this window is a potential match point for the corresponding candidate zero cross- 
ing point in the left j image. The time matching process is similar to the stereo match- 
ing process and is listed as follows : 

For each non- horizontal zero crossing point in the left r, image within the search 
space of a zero crossing point in the left r,_j image, 

i) Calculate the normalized correlation coefficient with a template size of txt between 
the grey level image of r,_i and r, at the corresponding locations. The normalized 
correlation coefficient is calculated by using the following expression : 
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yy. fe,-^)(y,,-5Q 

P,= =+= — = == (6-4) 

•JlStej -*) 2 ■JxZO'.y-T ) 2 

where 

jty : value at point (t,j) in the r,_i image. 

3 >y : value at point in the r,- image. 

I : mean value of the template in the r,_i image, 
y : mean value of the template in the t { image. 

ii) If the normalized correlation coefficient p, is less than a threshold value thrshd Pi , we 
discard that particular point in the remaining steps. 

iii) Calculate the directional difference weight ( w^,^)) between the left f,-_i and the 
left r, zero crossing point (within the search space) according to Equation (6.2). 

iv) Obtain the total weight as the combination of the correlation coefficient and the 
directional difference weight. 

w, = c x p f + d x WMfdimi)'' c + d= 1. (6.5) 

v) Within a search window in the left r, image, the zero crossing point with the max- 
imum total weight w, value is considered as the match point for the corresponding 
zero crossing point in the left r,_i image. 

6.I.2.3. Elimination of multiple matches 

After the subprocess of either stereo matching or time matching, there may be mul- 
tiple matches for some zero crossing points in either the left image or the left r,_! image. 
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We use the same procedure in eliminating both types (stereo or time) of multiple 
matches, except different sizes of the search window are exploited. For stereo matching, 
a 1-D search window of size d max is used on the left side of a multiple matched point; on 
the other hand, for time matching, we use a search window of size 2fxh on the left side of 
a multiple matched point. The remaining steps of the procedure for determining unambi- 
guous matched points are as follows : 

i) At the position of a multiple match in either the right image for stereo matching or 
the left ti image for time matching, open a search window either with a size of d mtx 
or with a size of 2 fxh, respectively. 

ii) Within the search window, locate all the positions that have the same (multiple) 
match. 

iii) If all the positions are within the neighborhood region r nttgh , calculate the total 
weight (w, or w,) according to Equation (6.3) or (6.5) (depending on whether we try 
to eliminate stereo multiple matches or time multiple matches). The position with 
the highest value in its total weight is regarded as the correct match. 

iv) If one or more positions are outside the neighborhood region r^h, we use the 
disambiguation procedure described in the following section to resolve the multiple 
matches. 

v) If multiple matches still exist after the application of the above steps, they all are 


discarded from the match set. 
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Disambiguation 

This procedure is used only if step (iv) in the above discussion is true. In this pro- 
cedure, the neighboring unambiguous matched points around a multiple matched point 
are used as one of the supporting evidences in determining the correct match. The other 
evidences used are the normalized correlation coefficient and the directional difference 
weight. The steps are listed as follows : 

i) At each position, calculate the normalized correlation coefficient (p, for stereo 
matching or p, for time matching) and the directional difference weight (wddif (stereo) 
for stereo matching or w^( (lTO ) for time matching). 

ii) Assign a correlation coefficient rank, R cc , and zero crossing pattern rank, R lcp , to 
each position according to its normalized correlation coefficient (p* or p f ) and direc- 
tional difference weight (wddif (stereo) or w^^)). The position with the highest 
value in normalized correlation coefficient or directional difference weight has the 
highest rank in R^ or R Kp . 

iii) At each position, check for unambiguous matched neighbors. If it has two attached 
unambiguous matched neighbors, a neighbor weight neigh wt of 3 is assigned. On the 
other hand, if it has only one attached unambiguous matched neighbor, a neighbor 
weight neigh wl of 2 is assigned. 

iv) At each position, open a check window of size 5x5 but exclude the center 3x3 region 
and count the number of unambiguous matched points, num smp . 

v) At each position, calculate the total possibility as the sum of the ranks, the weight 


and the number. 
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i rjs^r j<? 

Cf for* ; .. ;: ;T y 


pOSStot Rcc Rzpw + neigh wl + num smp ( 6 . 6 ) 

vi) The position with the highest poss lol value is considered as the correct match. 

Figures 6.3, 6.4 and 6.5 show the results of the algorithm for various steps. Figure 
6.3 shows the unambiguous stereo matching results of the r,_i (r 7 ) stereo image pair. 
Then, using these unambiguous matched points in the left r,_i (r 7 ) image as the candidate 
points, we match them with the feature points of left r, (r 8 ) image by using the time 
matching procedures. The unambiguous time matched points after the elimination of 
multiple matches are shown in Figure 6.4. Having the unambiguous matched points in 
the left 1 8 image, we match them with the feature points of right r g image and obtain the 
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Figure 6.3 : Unambiguous Matched Points of the r 7 Stereo Image 
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Figure 6.4 : Unambiguous Time Matched Points between the Left r 7 and r 8 Images 

unambiguous matched points after the elimination of multiple matches. With these two 
sets of unambiguous stereo matched points at r 7 and t s , we can pick out all the unambi- 
guous matched points that correspond to each other among the four images, and the 
results are depicted in Figure 6.5. 
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Figure 6.5 : Unambiguous Matched Points of the Images in Figure 6.1 
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6.2. Data Decomposition and Load Balancing Techniques 

In a multiprocessor system an obvious and simple method to implement a task in 
parallel is to decompose the data and and the underlying algorithm(s) equally among the 
processors. In a completely deterministic computation in which the computation is 
independent of the input data, such schemes perform well and normally the processing 
time is comparable on all the processors. That is, efficient utilization and load balancing 
can be obtained. For example, regular algorithms such as convolutions, filtering or FFT 
exhibit such properties. The amount of computation to obtain each output point is the 
same across all input data. Therefore, uniform decomposition of data results in load bal- 
anced implementation. 

Most other algorithms do not exhibit the regular structure, and the computation is 
data dependent. Furthermore, the computation is not uniformly distributed across the 
input domain. In such cases, a simple decomposition does not provide efficient mapping 
and results in poor utilization and low speedups. Also, the performance cannot be 
predicted for a given number of processors and data size because the computation varies 
with the type of data and its distribution. For example, in the stereo match algorithm, the 
computation is more where the feature points are dense and is comparatively small where 
the number of features is small and sparsely distributed (Figure 6.1). Hence, uniformly 
partitioning the input data among processors is not expected to provide good speedups 
and utilization. 

In an integrated vision system, it is important to efficiently allocate resources and 
perform load balancing at each step to obtain any significant performance gains overall. 
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An important characteristic of such systems is that the input data of a task is the output of 
the previous task. Therefore, while computing the output in the previous task enough 
knowledge about the data can be obtained to perform efficient scheduling and load 
balancing. In the following we discuss such techniques and in the next section we present 
the performance results for these techniques using algorithms in the motion estimation 
system. 

Consider a parallel implementation of a task on an n processor parallel machine. Let 
7, (\<i<n) denote the computation time at processor node i. Then the overall computation 
time for the task is given by 


Tmtx ~ maxf7’i,...,X )l ^ (6-7) 

The total wasted time (or idle time) T w is given by 

7*w> = 'TjjT m ax ~ 7*i) (6.8) 

i» 1 

If t ibmx - Ti for all i, l<i <ji, then the task will be completely load balanced. Another meas- 
ure of imbalance is given by the variation ratio V, 


V = ^ 22L , Tnun = minfT \ TJ (6.9) 

* min 

The goal in performing load balancing is to minimize T w , or move V as close to 1 as pos- 
sible. In the best case, T w - 0 or = 1. If T stq is the time to execute the same task on a 
sequential machine, then the speedup is given by 



( 6 . 10 ) 


Therefore, by minimizing T w , the achievable speedup can be maximized. 
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6.2.1. Uniform partitioning 

Data decomposition using uniform partitioning performs well as a load balancing 
strategy for input data independent tasks because equally dividing the data distributes the 
computation equally. If the total input data size is D then the total computation time to 
execute the task is T = kxD, where k is a determined by the computation at each input 
data point. For example, in convolution of a image with an mxm kernel, k = 2 xm 2 floating 
point operations. Hence, for an n node multiprocessor the data decomposition methods to 
balance the computation is to make the granule size to 

d, = — (6.11) 

n 

6.2.2. Static 

When the computation is not uniformly distributed across the input domain and is 
data dependent then uniform partitioning does not work well for load balancing. Nor- 
mally, the computation depends on the significant data or the type of data in a partition. 
Many image processing and vision algorithms exhibit this behavior. For example, in 
stereo match, and hough transform the computation is proportional to the number of 
features (edges) or significant pixels in a granule rather than on the granule size. There- 
fore, equal size granules do not guarantee load balanced partitioning because of the data 
dependent nature of the computation. In fact, the variation can be very significant as we 
shall observe in the next section when we discuss the performance. In many such algo- 
rithms, the computation time for a granule (z), 7), is proportional to a certain extent on the 
granule size (fixed overhead to process a granule) and to the number of significant data in 
the granule. That is, 
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T i =Axd i +Bxf i (6.12) 

where, di is the granule size, /; is a measure of significant data in a granule (i), and A and 

B are arbitrary constants which depend on the algorithm. Therefore, the objective is to 

divide the computation among the processors such that each processor receives an equal 

measure of computation to perform rather than an equal amount of data. One way to 

assign a granule to a processor will be to compute the total measure of computation and 

partition as follows: 


Ti = 


i-g 

Y^Axdi + Bxfi 

im 


(6.13) 


where g is the total number of granules in the input domain (Note that the number of 
granules for the current task is n for an n processor system). 


For example, consider computing hough transform of an edge image. The algorithm 
involves computing the parameters for line segments in the images. If there exists a line 
whose normal distance from the origin is r, the normal makes an angle 6 with the x-axis; 
then if the point (x,y) lies on that line, the following Equation is satisfied. 

r =xcosQ +ysinQ 

r and 0 are quantized for desired accuracy and then for each significant pixel (where 
there is an edge), and r is computed for all quantized 0 values. If two partitions of equal 
size contain a different number of edge pixels, then the amount of computation will be 
different for the two partitions despite their being equal in size. In fact, the computation 
is directly proportional to the number of edge pixels in the partition. A way to perform 
static load balancing will be to decompose the input data such that each partition contains 
an equal number of edge pixels. The computation to recognize this pardoning can be 
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performed in the task in which edges are detected by keeping a count of the number of 
edges detected by a processor. Note that it is important to compute the statistics on the 
fly when edges are detected to guarantee low overhead. If the same statics are gathered 
by sequentially scanning the input data then the overhead can be significant. Once the 
task is completed, the data can be reorganized such that the number of edges with each 


Z Z 

processor is in the interval (— - 8 , — + 8), where Z a is the total number of edges 

n n 

detected in the image and 8 is determined by the minimum granule size from fixed over- 
head considerations. 


6.2 J. Weighted static 

When the computation in a granule not only depends on the number of significant 
data points in the input domain but also depends on their spatial relationships, then data 
distribution also needs to be taken into account as a measure of load to perform load 
balancing. For example, from the previous section it is evident that to perform stereo 
match, not only does the computation depend on the number of zero crossings but also 
depends on their spatial distribution. If the zero crossings are densely spaced then the 
computation will be more than that if the same number of zero crossings is sparsely dis- 
tributed (refer to Figure 6.2). The reason is that if the zero crossings are densely packed 
then greater numbers of zero crossings need to be matched with each corresponding zero 
crossing in the other image, whereas fewer numbers of zero crossings need to be matched 
if they are sparsely distributed. Hence, the computation also depends on the spatial den- 
sity (such as features/row if one-dimensional matching is performed). That is, 
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Ti =Axdi +B XH>iXdi (6. 1 4) 

where w; is the feature dependent spatial density. For example, if the minimum granule 

size is a row of the input data then w, = rf , where r, is the number of features in row i and 

p is a parameter, 0<P^1. p=0 means that the computation is independent of how the 

features are distributed within a row. Therefore, to divide the computation equally among 

n processors, the following heuristic can be used. 


Ti = 


Axdi + BxwiXdi 

i-O 


(6.15) 


where, R is the number of rows in the image. Note that the above heuristics approximate 
the load and do not exactly divide the computation among processors. However, in the 
next section we will show that these schemes perform well. 

As an example, consider the stereo match computation. While partitioning the data 
among processors, a weight can be assigned to each row as a function of the number of 
features in the row. This weight represents the feature density. Note that using a row as 
the smallest granule avoids the communication overhead because search space for stereo 
matching is one-dimensional, and therefore, if the granule boundary is one row then there 
is no need for communication. 


6.2.4. Dynamic 

The above three methods use the knowledge about the data when it is produced to 
perform load balancing for the next task. However, once decomposition is done then the 
data is not reshuffled. Therefore, we consider the above methods as knowledged based 
static load balancing schemes. In the dynamic scheme, the data is decomposed into finer 
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granules such that the number of tasks (that is, the number of independent granules) M is 
much larger than the number of processors. 

At the execution time, the processors are assigned these tasks dynamically by a 
designated scheduler from a task queue which contains these tasks. Processors are 
assigned new tasks as they finish their assigned tasks, if there are more tasks left to be 
assigned. However, the knowledge obtained from the previous step again can be used to 
anticipate the completion of a task to assign a new task to a processor. That is, the tasks 
can be pipelined, and therefore, the overhead of the dynamic load balancing can be 
reduced. The communication overhead of dynamically assigning tasks is not incurred in 
the previous three schemes. Figure 6.6 shows the partitioning for the above described 
strategies for the stereo match algorithm. 

6.3. Parallel Implementation and Performance Evaluation 

This section presents a parallel implementation of the algorithms that are part of the 
motion estimation system and describes the performance of the algorithms and load 
balancing strategies. The algorithms were implemented and evaluated on a hypercube 
multiprocessor. 

6.3.1. Hypercube multiprocessor 

Chapter 1 contains a brief description of hypercube multiprocessor architecture. A 
typical commercially available hypercube multiprocessor system consists of a host pro- 
cessor and node processors. The host processor serves as the cube manager, provides 
interface with the external environment and input-output of data and program. We used 
Intel iPSC/2 hypercube multiprocessor consisting of 16 nodes. Each node consists of an 
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Figure 6.6 : Load Balancing Strategies 


Intel 80386 processor, Intel 80387 co-processor, 4 megabyte memory and a communica- 
tion module. 


6.3 2 . Feature extraction 

Features used for stereo match algorithms are the zero crossings of the convolution 
of the image with Laplacian, as presented in Section 6.1. Zero crossing computation 
involves 2-D convolution and extraction of zero crossings from the convolved image. 
Since convolution is a data independent algorithm, uniform partitioning is sufficient to 
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evenly distribute the computation. The mapping is a division of NxN image onto P pro- 
cessors. Each processor computes the zero crossings of a share of N 2 lP pixels (Equation 
6.11). Data division onto the processors is done along the rows. This mapping reduces 
communication to only one direction. The reason is as follows. 2-D convolution can be 
broken into two 1-D convolutions [50]. This not only reduces the computation from W 2 
sum of products operations per pixel to 2xW sum of product operations per pixel (W is 
the convolution mask window size) but also reduces the communication requirements in 
a parallel implementation if the data partitioning is done along the rows. There is no need 
for communication when convolution is performed along the rows. 

Table 6. 1 shows the performance results for the above implementation for an image 
of size 256x256 and a convolution window of size 20x20. The first column shows the 
number of processors in the cube( P). The second column represents the total processing 
time (tproc) for convolution. Column 3 shows the number of bytes communicated by a 
processor to the neighboring processor, and column 4 shows the corresponding commun- 
ication time which is small compared to the computation time. The second half of the 
table shows computation time for extracting zero crossings from the convolved image. 
The corresponding speedups are also shown. 

It can be observed that almost linear speedup is obtained for convolution. The two 
factors which contribute towards this result is that communication overhead is relatively 
small and is constant as the number of processors increases. However, the speedup 
obtained in the elapsed time (which includes the program and data load time also) is sub- 
linear due to the following reason. The hypercube multiprocessor’s host does not have a 
broadcast capability, and therefore, the overhead of loading the program increases 



154 


Table 6.1 : Performance for Feature Extraction (Zero Crossings) 



Computation for Convolution and Zero Crossings 


Convolution Window Size = 20x20 

No. Proc. 

Conv. 

Comp. 

Time(sec.) 

Conv. 

Comm. 

Bytes 

Conv. 

Comm. 

Time(ms.) 

Conv. 

Total 

Time(sec.) 

Conv. 
Speed Up 

ZC 

Comp. 

Time(sec.) 

ZC 

Speed Up 

1 

109.0 

0 

0 

109.0 

1 

6.47 

1 

2 

54.76 

2816 

13 

54.78 

1.98 

3.23 

1.99 

4 

2731 

5632 

36 

27.55 

3.95 

1.66 

3.89 

8 

13.88 

5632 

36 

13.92 

7.83 

0.85 

7.60 

16 

7.07 

5632 

36 

7.11 

15.33 

0.42 

15.25 


Feature Extraction Performance (Elapsed Time) 

No. Proc. 

Elapsed 

Time(sec.) 

Speed up 

1 

116.2 

1 

2 

58.8 

1.97 

4 

30.1 

3.86 

8 

16.1 

7.22 

16 

9.6 

12.1 


linearly with the number of processors. However, data load time increment with the 
increase in the number of processors is comparatively small because amount of data to be 
loaded to one processor decreases as the number of processors increase. The only incre- 
ment in data load time results from the number of communication setups from the host to 
the node processors which increases linearly with the number of processors. 

6.3 .3. Matching features 

This task involves matching features in stereo pairs of images. As discussed in Sec- 
tion 6.1, the epipolar constraint limits the search for a match in the corresponding image 
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to only in horizontal direction, i.e., along the rows in the zero crossings of the image. 
Thus data partitioning along the rows for parallel implementation results in no communi- 
cation between node processors as long as each partition contains an integral number of 
rows. 

The computation involved in the stereo matching algorithm is data dependent and 
varies across the image because it depends on the number of zero crossings, distribution 
of zero crossing across the image and distribution of zero crossings along the epipolar 
lines. Therefore, partioning the data uniformly among the processors (i.e., assign each 
processor an equal number of rows) may not yield expected speedups and processor utili- 
zation. A processor which has very few zero crossings and sparsely distributed zero 
crossings will be under utilized whereas a processor with a large number of zero cross- 
ings and densely distributed zero crossings will become a bottleneck, and this imbalance 
of load will result in a poor performance. 

We used uniform partitioning, static load balancing, weighted static and dynamic 
load balancing schemes to decompose computation on the multiprocessor. Static load 
balancing can be achieved by keeping a count of the zero crossings with each processor 
when the previous task (convolution and feature extraction) is executed. At the comple- 
tion of the task, the data is reorganized, using this information and the techniques 
described in the previous section. 

Figure 6.7 shows the distribution of the computation times for an 8-processor case. 
The X-axis shows the processor number and the Y-axis shows the computation time for 
each scheme. As we can observe, uniform partitioning does not perform well because the 
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Processors 

Figure 6.7 : Distribution of Computation Times for Stereo Match (P=8) 

variation in computation time is tremendous, and therefore, performance gains are 
minimal. The static load balancing scheme (shown as dashed bars) performs much better 
than uniform partitioning, but the variation in the computation times is still significant 
because the computation also depends on the distribution of zero crossings. The 
weighted static scheme does perform better than static and further reduces the variation 
in computation times. Note that these schemes only measure the load approximately, and 
therefore, will not divide the computation exactly uniformly. Furthermore, minimum 
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granularity is a row boundary in order to avoid communication between processors. 
Finally, for an 8-processor case, the dynamic scheme performs very well. Table 6.2 sum- 
marizes the distribution for the 8-processor case. The table shows the computation time 
for each processor for all four methods. Speedup is computed as follows. If T s is the 
sequential processing time and T mix is the maximum processing time of one processor 

T 

among n processors, then speedup is — - — . Variation ratio is the riltio of the maximum 

processing time to the minimum processing time and it provides a measure of imbalance 
in the computation. For example, in Table 6.2 the variation ratio is 44.25 for the case of 
uniform partitioning, 2.71 for the case of static load balancing, 1.50 for weighted static 


Table 6.2 : Distribution of Computation Times for Stereo Match 
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44.25 
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1.09 
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2.19 
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and 1.09 for dynamic load balancing. Improvement ratio is the ratio of speedup obtained 
with load balancing to that of uniform partitioning. The computation times shown in 
these tables include all the overhead of load balancing schemes. Figure 6.8 shows the 
speedup graph for varying sizes of multiprocessors from 1 processor to 16. As we can 
observe, uniform partitioning does not provide any significant gains in speedup as the 
number of processors increases. Dynamic scheme performs the best among all the 
schemes (at least for small processor size) but the two static schemes perform compar- 
able to the dynamic scheme. We believe that as the number of processors is increased. 



Figure 6.8 : Speedups for Stereo Match Computation 
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the two static schemes will move even closer to the dynamic scheme or even perform 
better than the dynamic scheme because for larger multiprocessors, the overhead of the 
dynamic scheme will be larger. One important conclusion from the above observations 
is that such a knowledge based scheme performs very well to schedule parallel tasks in 
an integrated vision system in which very often similar bottom up computations are per- 
formed in a sequence. 

6,3.4. Time match 

The computation in the time match algorithm is similar to that in stereo match 
except the search space is two-dimensional and the input to the algorithm is the stereo 
match output. Another difference is that the number of significant points in the input 
data is much smaller than that in stereo match because a great deal of input points get 
eliminated in stereo match. Table 6.3 shows the distribution of the computation times for 
the 16 processor case. We only present uniform partitioning and static load balancing 
cases. The most important observation is that uniform partitioning performs worse than 
that in the case of stereo match and static load balancing performs better. 

The table shows how the measure of computation (number of zero crossings left 
from stereo match step) gets divided among the processors in the two cases. It is clear 
that the number of zero crossings is very evenly distributed (within the minimum granule 
of one row constraint) in the static case whereas they are lumped with a few processors in 
the uniform partitioning case. Figure 6.9 shows the speedup graphs for the two schemes 
for a range of multiprocessor size. The speedup gains for the load balanced case is very 
significant over the uniform partitioning case. We computed the overhead of performing 
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Table 6.3 : Distribution of Computation Time for Time Match 


Computation for Time Match ( Proc. = 16) 
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Improvement 
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Uniform 


Balanced 15.72 


2.76 12.63 


knowledge based static load balancing and the overhead was 3 ms., which is negligible 
compared to the computation time, and the performance gains are significant. 
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Figure 6.9 : Speedup for Time Match 
6.3.5. Second stereo match 

This step involves stereo match computation for features from images at time 
instant r i+t after time point correspondence is established between images at time t, and 
r i+1 . The matching is similar to that in the first stereo match except that it needs to be 
done only at those points at which time correspondence has already been established. 
Consequently, the number of features to be matched is much less than that in the first 
computation, and hence, the importance of load balancing is further increased. Figure 
6.10 depicts the distribution of computation times for the second stereo match step. The 
three load balancing algorithms used in this case are Uniform Partitioning, Static and 
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Dynamic. As it is observed from the Figure, the uniform partitioning does not perform 
well at all compared to the other two schemes. The variation in computation time is 
significant. Furthermore, it is observed that static and dynamic schemes perform compar- 
ably. 

Figure 6.11 presents the speedups for the same algorithm for various multiprocessor 
"sizes. The Figure shows that the gains from these load balancing schemes are very 


Time 

(sec.) 



Processors 


Figure 6.10 : Distribution of Computation Times for Second Stereo Match (P=8) 
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significant over simple uniform partitioning. One important observation can be made by 
comparing results in Figure 6.8 and 6.11. Note that the performance of uniform parti- 
tioning in the second stereo match is much worse than that in the first stereo match. For 
example, for the 16 processor case, the speedup in the first case is 5.55, whereas for the 
same multiprocessor size speedup is only approximately 2.3. Therefore, as the computa- 
tion progresses in an integrated environment, the gains of these load balancing schemes 
become increasingly significant. Hence, overall gains for the entire system are better than 
what may be expected. 



Figure 6.1 1 : Speedups for Second Stereo Match 
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6.3.6. Summary 

In summary, the following important observations can be made from all the results 
presented in the previous sections. First, the improvement in performance (such as utili- 
zation and speedup) itself increases using the load balancing schemes as the number of 
processors increases. Therefore, performance gains are expected to be higher for larger 
multiprocessors. Secondly, in an integrated environment, the overheads of such methods 
are small because the measure of loads can be computed on the fly as a side result of the 
current task. Finally, though we showed the performance results of the implementation 
on the hypercube multiprocessor, these methods can be applied when algorithms are 
mapped on any medium to large grain multiprocessor system because these techniques 
are independent of the underlying multiprocessor architecture. 

Consider the overall performance gains for the entire system. As the computation 
progresses from one step to the next, uniform partitioning performs worse because the 
data points reduce, but the computation at each point increases. Therefore, the gains of 
using parallel processing are minimal. However, the load balancing techniques recognize 
the data distribution at each step and data is decomposed using the distribution. There- 
fore, performance gains are expected to improve as the computation progresses in an 
integrated systems environment For example, consider zero crossing, stereo match and 
time match and second stereo match steps. In zero crossing computation, uniform parti- 
tioning performs well and the load is balanced. Hence, the improvement ratio is 1. For 
stereo match the improvement of static over uniform partitioning is 2.15 for 8 processor 
case, and is 2.22 for the 16 processor case. Similarly, for the time match step, the 
improvement of static load balancing for 8 processor case is 3.38 and for the 16 
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processor case it is 4.2. Furthermore, for the second stereo match step, similar results are 
obtained. Therefore, it is observed that the improvement in performance itself increases 
as the number of processors increases as well as when the computation progresses in an 
integrated vision system. In summary, the performance gains are expected from these 
schemes for data decomposition and load balancing schemes as the number of processors 
increases, as the computation progresses in an integrated environment, and the overhead 
of these schemes is negligible compared to the performance gains. 
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CHAPTER 7. 
CONCLUSIONS 


7.1. Summary and Discussion 

This thesis has addressed several issues in multiprocessor architectures and parallel 
algorithms for Integrated Vision Systems. The approach has been to consider computa- 
tional requirements for vision applications in an integrated environment in designing a 
multiprocessor architecture rather than to propose architecture solutions to perform indi- 
vidual algorithms efficiently. An IVS involves algorithms from several levels of process- 
ing and the characteristics of algorithms in each level differ tremendously from algo- 
rithms in other levels. However, these algorithms need to exist in a system simultane- 
ously and interact with each other. Therefore, a multiprocessor architecture suitable for 
IVS applications must be partitionable and reconfigurable. It must have the capability to 
allocate resources dynamically, provide for dynamic load balancing and task scheduling, 
provide fast and flexible communication, provide efficient I/O, and be fault-tolerant, in 
addition to providing raw processing power. 

Chapter 2 presented a model of computation for IVS. The model captures the com- 
putation requirements in an IVS, spatial as well as temporal data dependencies, and sug- 
gests what types of parallelism may be available in tasks of an IVS. Using the model 
desired features and capabilities of an architecture suitable for IVSs are identified. 
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Another important aspect of the model is that it incorporates temporal relationships 
between tasks which are absent in the Image Understanding Benchmarks presented in 
[ 1 ]. 

Architecture of NETRA has been presented in Chapter 3. The original form of 
NETRA was proposed by Sharma, Patel and Ahuja in [3]. Several refinements to the 
architecture have been presented in this thesis after careful and detailed considerations of 
the computational requirements of an IVS. The modifications include alternative inter- 
cluster communication strategies and the synchronization bus in a cluster. The architec- 
ture was critiqued in the light of computation requirements for IVSs developed in 
Chapter 2. It was argued that the architecture provides most of the features, such as 
reconfigurability, partitionability, flexible communication, and fast I/O, needed in a mul- 
tiprocessor of IVS. Furthermore, a discussion on how to provide these capabilities was 
included. 

In Chapter 4 the performance of various algorithms on a processor cluster was 
presented. The evaluation of a cluster using several algorithms indicates that the cluster 
provides flexibility of communication, ability to reconfigure in SIMD, MIMD and sys- 
tolic modes, and shown that almost linear speedups are possible in most cases. The most 
important observation is that the programmable crossbar design reduces the overheads of 
mapping parallelism by providing selective broadcast capability and the ability to pro- 
vide the best interconnection for a particular algorithm. Both analytical and implementa- 
tion results were presented. Performance evaluation of some algorithms from the Image 
Understanding Benchmark were also presented. 
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Chapter 5 presented alternative inter-cluster communication strategies in NETRA 
and evaluation of parallel algorithms when mapped across multiple clusters. When an 
algorithm is mapped on multiple clusters, processors between different clusters need to 
communicate. This requires accessing the global interconnection network and global 
memory. However, conflicts may occur in accessing global interconnection and global 
memory, which in turn, affects the performance of an algorithm. Presented in the first 
part are two inter-cluster communication strategies, viz; using global memory via multis- 
tage network and a high speed bus connecting the clusters together. An analysis was 
made to compute delays through the global network due to conflicts so that their effects 
can be incorporated into performance evaluation of algorithms. In the second pan, per- 
formance of several algorithms, when mapped across multiple clusters, is presented. The 
results indicate that even in the case of a large number of conflicts, good (almost linear 
speedups) performance can be obtained for several algorithms when a multistage net- 
work is used. However, in order for the bus to be a viable global interconnection, the bus 
bandwidth and speed must be much greater than the processor speed. 

Data decomposition and load balancing techniques were presented in Chapter 6. In 
order to obtain any significant performance gains from parallel implementation of inter- 
mediate and low level algorithms, efficient load balancing is important because the com- 
putation is normally data dependent. The main contributions have been to present tech- 
niques to perform data decomposition and load balancing schemes that exploit 
knowledge about the computation and the data in a task. Since in an IVS such knowledge 
for the next task is normally available while performing the current task, the overheads 
are minimal. Four techniques presented are uniform partitioning, which is shown to be 
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good enough for data independent algorithms: static and weighted static, which are 
shown to perform well when computation is dependent on the amount of significant data 
and its distribution; and the dynamic load balancing which is shown to work as well. 
However, in the dynamic load balancing scheme, the spatial relationships between data 
element are not maintained. Performances of all the techniques have been shown using 
algorithms from a motion estimation system, and it is concluded that these schemes per- 
form well and provide tremendous improvements in utilization and speedups. 

7.2. Future Work 

Future work relating to this thesis can be put into three categories, namely, architec- 
tural issues, parallel algorithms issues and systems issues. The following is a brief discus- 
sion on each of these issues. 

Specification and more detailed design of the communication networks in NETRA 
are areas for an extension to the current work. This involves design of communication 
protocols for both intra-cluster and inter-cluster communications. For example, the 
design of communication protocols should address issues such as where does the respon- 
sibility of programming the crossbar lie, or how the crossbar will be addressed. Once the 
specifications are provided, the crossbar can be designed in detail. Another task is to 
develop a versatile simulator for NETRA which incorporates clusters as well as other 
parts of NETRA in detail. The simulator can be used to evaluate algorithms in detail, and 
using the results of the evaluation, refinements in the architecture can be performed. 

The second avenue for future research is in the area of parallel algorithm issues, 
both specific to implementing them on NETRA as well as on other architectures. The 
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foremost task is to map a variety of algorithms to evaluate the architecture as well as to 
develop a general approach to mapping parallel algorithms. Study and evaluation of algo- 
rithms, especially intermediate and high level, are necessary to obtain a better under- 
standing of their characteristics from parallel implementation perspective. Furthermore, 
if the algorithms are evaluated using existing parallel machines, then a better understand- 
ing obtained for architectural issues as well as knowledge can be gained about overheads 
associated with implementation of such algorithms. 

The third issue, termed systems issue, deals with the design of operating systems 
and development of a programming environment for NETRA. The design should specify 
tasks for various elements of the architecture, how the tasks interact with one another, 
how to specify and incorporate data and knowledge base about vision systems, design of 
protocols to use data, and knowledge base. Furthermore, the systems issues include 
designing protocols for load balancing, partitioning and allocating resources, memory 
management and task management. The most important and ultimate task is to integrate 
all of the above into one system. 
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